dsci524_group29_webscraping

Submodules

Attributes

__version__

Functions

save_data(data[, format, destination])

Saves the extracted data into a file.

parse_content(html_content, selector[, selector_type])

Parses HTML content to extract data based on the provided selector.

fetch_html(url[, timeout])

Fetches the HTML content of a given URL.

Package Contents

dsci524_group29_webscraping.__version__
dsci524_group29_webscraping.save_data(data, format='csv', destination='output.csv')[source]

Saves the extracted data into a file.

Parameters:
  • data (list or dict) – The data to be saved. - For ‘csv’, it must be a list of dictionaries where each dictionary represents a row. - For ‘json’, it can be either a list or a dictionary.

  • format (str, optional) – The format in which to save the data. Options are: - ‘csv’: Saves the data as a CSV file. Each key in the dictionaries becomes a column header. - ‘json’: Saves the data as a JSON file. The data is serialized with indentation for readability. Default is ‘csv’.

  • destination (str, optional) – The file path to save the data. Can specify: - A file name (e.g., ‘output.csv’). - A full path (e.g., ‘/path/to/output.csv’). Default is ‘output.csv’.

Returns:

The absolute path to the saved file.

Return type:

str

Raises:
  • ValueError – If the format is unsupported or if the data structure is incompatible with the format.

  • FileNotFoundError – If the directory specified in the destination path does not exist.

  • Exception – If an unexpected error occurs during the file-writing process.

Examples

# Save data as a CSV file save_data([{“name”: “Alice”, “age”: 25}, {“name”: “Bob”, “age”: 30}], format=’csv’, destination=’data.csv’)

# Save data as a JSON file save_data({“name”: “Alice”, “age”: 25}, format=’json’, destination=’data.json’)

Notes

  • The directory specified in the destination path must exist; otherwise, a FileNotFoundError is raised.

  • For ‘csv’, the first dictionary in the list determines the column headers.

dsci524_group29_webscraping.parse_content(html_content, selector, selector_type='css')[source]

Parses HTML content to extract data based on the provided selector.

Parameters:
  • html_content (str) – The raw HTML content to be parsed.

  • selector (str) – The query to locate elements in the HTML content. - For CSS selectors: Use .class, #id, or tagname. - For XPath: Use expressions like //tag[@attribute=’value’].

  • selector_type (str, optional) – The type of selector to use. Options: - ‘css’: Uses a CSS selector (e.g., .item selects elements with class “item”). - ‘xpath’: Uses an XPath expression (e.g., //div[@class=’item’] selects <div> elements with class “item”). Case-insensitive. Default is ‘css’.

Returns:

A list of dictionaries containing extracted data.
  • Example output: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}].

Return type:

list

Raises:

ValueError – If the selector_type is unsupported or an error occurs during parsing.

Example

# Sample HTML content html_content = ‘<html><body><div class=”item”>alfa</div><div class=”item”>bravo</div><div class=”item”>charlie</div></body></html>’

# Using a CSS selector parse_content(html_content, “.item”) # Returns: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}]

# Using an XPath selector parse_content(html_content, “//div[@class=’item’]”, selector_type=’xpath’) # Returns: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}]

dsci524_group29_webscraping.fetch_html(url, timeout=10)[source]

Fetches the HTML content of a given URL.

Parameters:
  • url (str) – The URL of the webpage to fetch.

  • timeout (int, optional) – The maximum time to wait for a response, in seconds. Defaults to 10 seconds.

Returns:

The raw HTML content of the webpage if the request is successful.

Return type:

str

Raises:
  • ValueError – If the URL provided is invalid or improperly formatted.

  • requests.exceptions.Timeout – If the request times out before receiving a response.

  • requests.exceptions.RequestException – For other issues during the HTTP request, such as connectivity problems or a non-success HTTP status code.

Examples

Fetch the HTML content of a webpage: >>> html_content = fetch_html(”https://example.com”) >>> print(html_content[:100]) # Prints the first 100 characters of the HTML content

Notes

  • This function uses the requests library to perform an HTTP GET request.

  • Ensure the requests library is installed before using this function.