{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial\n", "\n", "This package provides tools for web scraping, including:\n", "\n", "- Fetching HTML content from a URL.\n", "- Parsing specific elements from the HTML content.\n", "- Saving the extracted data to a file.\n", "\n", "In this tutorial, you will learn how to use each function in the package with real-life examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_First, you will need to import these three functions in order to use them in your own pipeline. The functions can easily be imported with the example code in this cell._" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dsci524_group29_webscraping.fetch_html import fetch_html\n", "from dsci524_group29_webscraping.parse_content import parse_content\n", "from dsci524_group29_webscraping.save_data import save_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetch HTML content from a website" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Next, call the `fetch_html` function, with the URL of the website you want to scrape. In the example below, we use the [IANA Example Domain](https://example.com). The output from this website is simple and can be printed as illustrated below._\n", "\n", "_You can try it with another website of your choosing. However, you might want to first check the length of the response (`len(html_content)`) to see if you can print all of it in your notebook or to the console._" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "
\n", "This domain is for use in illustrative examples in documents. You may use this\n", " domain in literature without prior coordination or asking for permission.
\n", " \n", "`) from the HTML content using CSS selector. The code below shows show you can do that:_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'value': 'This domain is for use in illustrative examples in documents. You may use this\\n domain in literature without prior coordination or asking for permission.'}, {'value': None}]\n" ] } ], "source": [ "parsed_data = parse_content(html_content, selector=\"p\", selector_type=\"css\")\n", "print(parsed_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_From the same sample HTML, we might want to parse HTML Heading 1 tags (`
` tags. The `save_data` function will allow you to save all the elements that we retrieved into a file._\n", "\n", "_The example below saves the `
` tags retrieved from the example above (in the `parsed_data` variable) to a CSV file `output_paragraphs.csv`:_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Paragraphs saved to 'output_paragraphs.csv'.\n" ] } ], "source": [ "save_data(parsed_data, format=\"csv\", destination=\"output_paragraphs.csv\")\n", "print(\"Paragraphs saved to 'output_paragraphs.csv'.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_And the one below saves the `