dsci524_group29_webscraping.parse_content
Functions
|
Parses HTML content to extract data based on the provided selector. |
Module Contents
- dsci524_group29_webscraping.parse_content.parse_content(html_content, selector, selector_type='css')[source]
Parses HTML content to extract data based on the provided selector.
- Parameters:
html_content (str) – The raw HTML content to be parsed.
selector (str) – The query to locate elements in the HTML content. - For CSS selectors: Use .class, #id, or tagname. - For XPath: Use expressions like //tag[@attribute=’value’].
selector_type (str, optional) – The type of selector to use. Options: - ‘css’: Uses a CSS selector (e.g., .item selects elements with class “item”). - ‘xpath’: Uses an XPath expression (e.g., //div[@class=’item’] selects <div> elements with class “item”). Case-insensitive. Default is ‘css’.
- Returns:
- A list of dictionaries containing extracted data.
Example output: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}].
- Return type:
list
- Raises:
ValueError – If the selector_type is unsupported or an error occurs during parsing.
Example
# Sample HTML content html_content = ‘<html><body><div class=”item”>alfa</div><div class=”item”>bravo</div><div class=”item”>charlie</div></body></html>’
# Using a CSS selector parse_content(html_content, “.item”) # Returns: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}]
# Using an XPath selector parse_content(html_content, “//div[@class=’item’]”, selector_type=’xpath’) # Returns: [{‘value’: ‘alfa’}, {‘value’: ‘bravo’}, {‘value’: ‘charlie’}]