dhelp package¶

Submodules¶

dhelp.files module¶

dhelp.settings module¶

dhelp.text module¶

dhelp.web module¶

class dhelp.web.WebPage(url, options={})[source]¶

Bases: collections.UserString

Downloads and parses HTML into BeautifulSoup objects.

Provides methods to download/parse a specified webpage. Merges the request package with BeautifulSoup functions to enable users to request/soup a page in a single line.

Parameters:	url (`str`) options (`dict`, optional)

Examples

>>> from dhelp import WebPage
>>> web_page = WebPage('https://stackoverflow.com')
>>> print(web_page)
'https://stackoverflow.com'
>>> # pass an dict to set options for delay, max_retries, or silent
>>> options = {
...     'delay': 4,
        'max_retries': 3,
        'silent': True
        'parser': 'html.parser'
... }
>>> web_page = WebPage('https://stackoverflow.com', options=options)
https://stackoverflow.com

fetch(retry_counter=0)[source]¶

Returns http request from URL as a string.

Can be called to return HTML data, although not generally meant to be called directly by user. If user calls .fetch(), retry_counter should not be passed so that it will start at 0. This function is intended to be called by .soup() in order to feed its parser.

If the request was not successful, .fetch() calls itself recursively until it is either successful, or the maximum number of attempts has been reached. If the .max_retries property is set to 0, .fetch() will make inifinite requests.

Parameters:	retry_counter (`int`)
Returns:	`str` HTML from requested URL, in plain text format

Examples

>>> html_text = WebPage('https://stackoverflow.com/').fetch()
<!DOCTYPE html>\r\n<html>\r\n\r\n    <head>\r\n\r\n        <title>Stack Overflow...

soup()[source]¶

Returns a BeautifulSoup object loaded with HTML data from the URL

Invokes web request then returns a soup object loaded with page HTML. Uses html.parser with BeautifulSoup. Child classes may override this to use other parsers (e.g. lxml).

Returns:	`bs4.BeautifulSoup` BeautifulSoup object loaded with parsed data from web

Examples

>>> # fetch webpage and parse into BeautifulSoup object
>>> parsed_webpage = WebPage('https://stackoverflow.com/').soup()
>>> # grab the logo from the header with BeautifulSoup
>>> header_logo_text = parsed_webpage.find('header')
...    .find('div', class_='-main')
...    .find('span', class_='-img')
>>> # print the text contained in the span tag
>>> print(header_logo_text.get_text())
Stack Overflow

Module contents¶

dhelp

Students often see great potential in Python for historical analysis. But, before they see real payoff they often face too many hurdles to overcome in the space of a single semester. dhelp is a tool to allow students to quickly get to performing quick file operations, data manipulations, and even text analysis.