dsci524_group29_webscraping
Submodules
Attributes
Functions
|
Saves the extracted data into a file. |
|
Parses HTML content to extract data based on the provided selector. |
|
Fetches the HTML content of a given URL. |
Package Contents
- dsci524_group29_webscraping.__version__
- dsci524_group29_webscraping.save_data(data, format='csv', destination='output.csv')[source]
Saves the extracted data into a file.
- Parameters:
data (list or dict) – The data to be saved. - For ‘csv’, it must be a list of dictionaries where each dictionary represents a row. - For ‘json’, it can be either a list or a dictionary.
format (str, optional) – The format in which to save the data. Options are: - ‘csv’: Saves the data as a CSV file. Each key in the dictionaries becomes a column header. - ‘json’: Saves the data as a JSON file. The data is serialized with indentation for readability. Default is ‘csv’.
destination (str, optional) – The file path to save the data. Can specify: - A file name (e.g., ‘output.csv’). - A full path (e.g., ‘/path/to/output.csv’). Default is ‘output.csv’.
- Returns:
The absolute path to the saved file.
- Return type:
str
- Raises:
ValueError – If the format is unsupported or if the data structure is incompatible with the format.
FileNotFoundError – If the directory specified in the destination path does not exist.
Exception – If an unexpected error occurs during the file-writing process.
Examples
# Save data as a CSV file save_data([{“name”: “Alice”, “age”: 25}, {“name”: “Bob”, “age”: 30}], format=’csv’, destination=’data.csv’)
# Save data as a JSON file save_data({“name”: “Alice”, “age”: 25}, format=’json’, destination=’data.json’)
Notes
The directory specified in the destination path must exist; otherwise, a FileNotFoundError is raised.
For ‘csv’, the first dictionary in the list determines the column headers.
- dsci524_group29_webscraping.parse_content(html_content, selector, selector_type='css')[source]
Parses HTML content to extract data based on the provided selector.
- Parameters:
html_content (str) – The raw HTML content to be parsed.
selector (str) – The query to locate elements in the HTML content. - For CSS selectors: Use .class, #id, or tagname. - For XPath: Use expressions like //tag[@attribute=’value’].
selector_type (str, optional) – The type of selector to use. Options: - ‘css’: Uses a CSS selector (e.g., .item selects elements with class “item”). - ‘xpath’: Uses an XPath expression (e.g., //div[@class=’item’] selects <div> elements with class “item”). Case-insensitive. Default is ‘css’.
- Returns:
- A list of dictionaries containing extracted data.
Example output: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}].
- Return type:
list
- Raises:
ValueError – If the selector_type is unsupported or an error occurs during parsing.
Example
# Sample HTML content html_content = ‘<html><body><div class=”item”>alfa</div><div class=”item”>bravo</div><div class=”item”>charlie</div></body></html>’
# Using a CSS selector parse_content(html_content, “.item”) # Returns: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}]
# Using an XPath selector parse_content(html_content, “//div[@class=’item’]”, selector_type=’xpath’) # Returns: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}]
- dsci524_group29_webscraping.fetch_html(url, timeout=10)[source]
Fetches the HTML content of a given URL.
- Parameters:
url (str) – The URL of the webpage to fetch.
timeout (int, optional) – The maximum time to wait for a response, in seconds. Defaults to 10 seconds.
- Returns:
The raw HTML content of the webpage if the request is successful.
- Return type:
str
- Raises:
ValueError – If the URL provided is invalid or improperly formatted.
requests.exceptions.Timeout – If the request times out before receiving a response.
requests.exceptions.RequestException – For other issues during the HTTP request, such as connectivity problems or a non-success HTTP status code.
Examples
Fetch the HTML content of a webpage: >>> html_content = fetch_html(”https://example.com”) >>> print(html_content[:100]) # Prints the first 100 characters of the HTML content
Notes
This function uses the requests library to perform an HTTP GET request.
Ensure the requests library is installed before using this function.