Tutorial

This package provides tools for web scraping, including:

  • Fetching HTML content from a URL.

  • Parsing specific elements from the HTML content.

  • Saving the extracted data to a file.

In this tutorial, you will learn how to use each function in the package with real-life examples.

Imports

First, you will need to import these three functions in order to use them in your own pipeline. The functions can easily be imported with the example code in this cell.

from dsci524_group29_webscraping.fetch_html import fetch_html
from dsci524_group29_webscraping.parse_content import parse_content
from dsci524_group29_webscraping.save_data import save_data

Fetch HTML content from a website

Next, call the fetch_html function, with the URL of the website you want to scrape. In the example below, we use the IANA Example Domain. The output from this website is simple and can be printed as illustrated below.

You can try it with another website of your choosing. However, you might want to first check the length of the response (len(html_content)) to see if you can print all of it in your notebook or to the console.

url = "https://example.com"
html_content = fetch_html(url)
print(html_content)
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

Parse HTML content using different selectors

Now you can parse the HTML text to extract specific elements from it. For this, you will need to have some basic understanding of HTML, which you can review here.

For example, from the example html retrieved in the previous step, we might want to parse paragraph tags (<p>) from the HTML content using CSS selector. The code below shows show you can do that:

parsed_data = parse_content(html_content, selector="p", selector_type="css")
print(parsed_data)
[{'value': 'This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.'}, {'value': None}]

From the same sample HTML, we might want to parse HTML Heading 1 tags (<h1>) using XPath selector as shown in the code below:

parsed_headings = parse_content(html_content, selector="//h1", selector_type="xpath")
print(parsed_headings)
[{'value': 'Example Domain'}]

Save parsed data to CSV file

And finally, you can save just the bits you extracted from the HTML in a file! In the example above, we retrieve a simple list of 1 element in each case. However, a web page will typically have several elements fitting the specification. For instance, a page might have several <h1> or <p> tags. The save_data function will allow you to save all the elements that we retrieved into a file.

The example below saves the <p> tags retrieved from the example above (in the parsed_data variable) to a CSV file output_paragraphs.csv:

save_data(parsed_data, format="csv", destination="output_paragraphs.csv")
print("Paragraphs saved to 'output_paragraphs.csv'.")
Paragraphs saved to 'output_paragraphs.csv'.

And the one below saves the <h1> tags retrieved from the example above (in the parsed_headings variable) to a CSV file output_headings.csv:

save_data(parsed_headings, format="csv", destination="output_headings.csv")
print("Headings saved to 'output_headings.csv'.")
Headings saved to 'output_headings.csv'.

Now you can use these examples to try out many other websites! Here is an easy suggestion: how many <h2> tags are on the UBC MDS homepage?