dhelp package¶
Submodules¶
dhelp.files module¶
dhelp.settings module¶
dhelp.text module¶
dhelp.web module¶
-
class
dhelp.web.
WebPage
(url, options={})[source]¶ Bases:
collections.UserString
Downloads and parses HTML into BeautifulSoup objects.
Provides methods to download/parse a specified webpage. Merges the request package with BeautifulSoup functions to enable users to request/soup a page in a single line.
Parameters: - url (
str
) - options (
dict
, optional)
Examples
>>> from dhelp import WebPage >>> web_page = WebPage('https://stackoverflow.com') >>> print(web_page) 'https://stackoverflow.com' >>> # pass an dict to set options for delay, max_retries, or silent >>> options = { ... 'delay': 4, 'max_retries': 3, 'silent': True 'parser': 'html.parser' ... } >>> web_page = WebPage('https://stackoverflow.com', options=options) https://stackoverflow.com
-
fetch
(retry_counter=0)[source]¶ Returns http request from URL as a string.
Can be called to return HTML data, although not generally meant to be called directly by user. If user calls .fetch(), retry_counter should not be passed so that it will start at 0. This function is intended to be called by .soup() in order to feed its parser.
If the request was not successful, .fetch() calls itself recursively until it is either successful, or the maximum number of attempts has been reached. If the .max_retries property is set to 0, .fetch() will make inifinite requests.
Parameters: retry_counter ( int
)Returns: str
HTML from requested URL, in plain text formatExamples
>>> html_text = WebPage('https://stackoverflow.com/').fetch() <!DOCTYPE html>\r\n<html>\r\n\r\n <head>\r\n\r\n <title>Stack Overflow...
-
soup
()[source]¶ Returns a BeautifulSoup object loaded with HTML data from the URL
Invokes web request then returns a soup object loaded with page HTML. Uses html.parser with BeautifulSoup. Child classes may override this to use other parsers (e.g. lxml).
Returns: bs4.BeautifulSoup
BeautifulSoup object loaded with parsed data from webExamples
>>> # fetch webpage and parse into BeautifulSoup object >>> parsed_webpage = WebPage('https://stackoverflow.com/').soup() >>> # grab the logo from the header with BeautifulSoup >>> header_logo_text = parsed_webpage.find('header') ... .find('div', class_='-main') ... .find('span', class_='-img') >>> # print the text contained in the span tag >>> print(header_logo_text.get_text()) Stack Overflow
- url (
Module contents¶
dhelp
David J. Thomas, thePortus.com, Copyright, 2018
Students often see great potential in Python for historical analysis. But, before they see real payoff they often face too many hurdles to overcome in the space of a single semester. dhelp is a tool to allow students to quickly get to performing quick file operations, data manipulations, and even text analysis.