dsci524_group29_webscraping =========================== .. py:module:: dsci524_group29_webscraping Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/dsci524_group29_webscraping/fetch_html/index /autoapi/dsci524_group29_webscraping/parse_content/index /autoapi/dsci524_group29_webscraping/save_data/index Attributes ---------- .. autoapisummary:: dsci524_group29_webscraping.__version__ Functions --------- .. autoapisummary:: dsci524_group29_webscraping.save_data dsci524_group29_webscraping.parse_content dsci524_group29_webscraping.fetch_html Package Contents ---------------- .. py:data:: __version__ .. py:function:: save_data(data, format='csv', destination='output.csv') Saves the extracted data into a file. :param data: The data to be saved. - For 'csv', it must be a list of dictionaries where each dictionary represents a row. - For 'json', it can be either a list or a dictionary. :type data: list or dict :param format: The format in which to save the data. Options are: - 'csv': Saves the data as a CSV file. Each key in the dictionaries becomes a column header. - 'json': Saves the data as a JSON file. The data is serialized with indentation for readability. Default is 'csv'. :type format: str, optional :param destination: The file path to save the data. Can specify: - A file name (e.g., 'output.csv'). - A full path (e.g., '/path/to/output.csv'). Default is 'output.csv'. :type destination: str, optional :returns: The absolute path to the saved file. :rtype: str :raises ValueError: If the format is unsupported or if the data structure is incompatible with the format. :raises FileNotFoundError: If the directory specified in the destination path does not exist. :raises Exception: If an unexpected error occurs during the file-writing process. .. rubric:: Examples # Save data as a CSV file save_data([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}], format='csv', destination='data.csv') # Save data as a JSON file save_data({"name": "Alice", "age": 25}, format='json', destination='data.json') .. rubric:: Notes - The directory specified in the destination path must exist; otherwise, a FileNotFoundError is raised. - For 'csv', the first dictionary in the list determines the column headers. .. py:function:: parse_content(html_content, selector, selector_type='css') Parses HTML content to extract data based on the provided selector. :param html_content: The raw HTML content to be parsed. :type html_content: str :param selector: The query to locate elements in the HTML content. - For CSS selectors: Use `.class`, `#id`, or `tagname`. - For XPath: Use expressions like `//tag[@attribute='value']`. :type selector: str :param selector_type: The type of selector to use. Options: - 'css': Uses a CSS selector (e.g., `.item` selects elements with class "item"). - 'xpath': Uses an XPath expression (e.g., `//div[@class='item']` selects
elements with class "item"). Case-insensitive. Default is 'css'. :type selector_type: str, optional :returns: A list of dictionaries containing extracted data. - Example output: `[{'value': 'alfa'}, {'value': 'bravo'}, {'value': 'charlie'}]`. :rtype: list :raises ValueError: If the selector_type is unsupported or an error occurs during parsing. .. rubric:: Example # Sample HTML content html_content = '
alfa
bravo
charlie
' # Using a CSS selector parse_content(html_content, ".item") # Returns: [{'value': 'alfa'}, {'value': 'bravo'}, {'value': 'charlie'}] # Using an XPath selector parse_content(html_content, "//div[@class='item']", selector_type='xpath') # Returns: [{'value': 'alfa'}, {'value': 'bravo'}, {'value': 'charlie'}] .. py:function:: fetch_html(url, timeout=10) Fetches the HTML content of a given URL. :param url: The URL of the webpage to fetch. :type url: str :param timeout: The maximum time to wait for a response, in seconds. Defaults to 10 seconds. :type timeout: int, optional :returns: The raw HTML content of the webpage if the request is successful. :rtype: str :raises ValueError: If the URL provided is invalid or improperly formatted. :raises requests.exceptions.Timeout: If the request times out before receiving a response. :raises requests.exceptions.RequestException: For other issues during the HTTP request, such as connectivity problems or a non-success HTTP status code. .. rubric:: Examples Fetch the HTML content of a webpage: >>> html_content = fetch_html("https://example.com") >>> print(html_content[:100]) # Prints the first 100 characters of the HTML content .. rubric:: Notes - This function uses the `requests` library to perform an HTTP GET request. - Ensure the `requests` library is installed before using this function.