{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial\n", "\n", "This package provides tools for web scraping, including:\n", "\n", "- Fetching HTML content from a URL.\n", "- Parsing specific elements from the HTML content.\n", "- Saving the extracted data to a file.\n", "\n", "In this tutorial, you will learn how to use each function in the package with real-life examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_First, you will need to import these three functions in order to use them in your own pipeline. The functions can easily be imported with the example code in this cell._" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dsci524_group29_webscraping.fetch_html import fetch_html\n", "from dsci524_group29_webscraping.parse_content import parse_content\n", "from dsci524_group29_webscraping.save_data import save_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetch HTML content from a website" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Next, call the `fetch_html` function, with the URL of the website you want to scrape. In the example below, we use the [IANA Example Domain](https://example.com). The output from this website is simple and can be printed as illustrated below._\n", "\n", "_You can try it with another website of your choosing. However, you might want to first check the length of the response (`len(html_content)`) to see if you can print all of it in your notebook or to the console._" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", " Example Domain\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "

\n", "

Example Domain

\n", "

This domain is for use in illustrative examples in documents. You may use this\n", " domain in literature without prior coordination or asking for permission.

\n", "

More information...

\n", "

\n", "\n", "\n", "\n" ] } ], "source": [ "url = \"https://example.com\"\n", "html_content = fetch_html(url)\n", "print(html_content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parse HTML content using different selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Now you can parse the HTML text to extract specific elements from it. For this, you will need to have some basic understanding of HTML, which you can review [here](https://www.w3schools.com/html/html_basic.asp)._\n", "\n", "_For example, from the [example html](https://example.com) retrieved in the previous step, we might want to parse paragraph tags (`

`) from the HTML content using CSS selector. The code below shows show you can do that:_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'value': 'This domain is for use in illustrative examples in documents. You may use this\\n domain in literature without prior coordination or asking for permission.'}, {'value': None}]\n" ] } ], "source": [ "parsed_data = parse_content(html_content, selector=\"p\", selector_type=\"css\")\n", "print(parsed_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_From the same sample HTML, we might want to parse HTML Heading 1 tags (`

`) using XPath selector as shown in the code below:_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'value': 'Example Domain'}]\n" ] } ], "source": [ "parsed_headings = parse_content(html_content, selector=\"//h1\", selector_type=\"xpath\")\n", "print(parsed_headings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save parsed data to CSV file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_And finally, you can save just the bits you extracted from the HTML in a file! In the example above, we retrieve a simple list of 1 element in each case. However, a web page will typically have several elements fitting the specification. For instance, a page might have several `

` or `
` tags. The `save_data` function will allow you to save all the elements that we retrieved into a file._\n", "\n", "_The example below saves the `
` tags retrieved from the example above (in the `parsed_data` variable) to a CSV file `output_paragraphs.csv`:_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Paragraphs saved to 'output_paragraphs.csv'.\n" ] } ], "source": [ "save_data(parsed_data, format=\"csv\", destination=\"output_paragraphs.csv\")\n", "print(\"Paragraphs saved to 'output_paragraphs.csv'.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_And the one below saves the `

` tags retrieved from the example above (in the `parsed_headings` variable) to a CSV file `output_headings.csv`:_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Headings saved to 'output_headings.csv'.\n" ] } ], "source": [ "save_data(parsed_headings, format=\"csv\", destination=\"output_headings.csv\")\n", "print(\"Headings saved to 'output_headings.csv'.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Now you can use these examples to try out many other websites! Here is an easy suggestion: how many `

` tags are on the [UBC MDS homepage](https://masterdatascience.ubc.ca/)?_" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 4 }