Beginner-Friendly Python Toolkit for Web Scraping
A Python package for simplified web scraping functionality for data scientists new to web scraping.
Create Virtual Environment
We recommend that you first create a new conda environment with the latest version of Python (if not, then version 3.9 or higher). You can do so with the command below:
conda create --name webscraping python=3.12 -y
After the environment is created, activate the environment:
conda activate webscraping
Installation
You can then install the package in your environment using the command:
pip install dsci524_group29_webscraping
After installation, have fun scraping the web!
Functions
The package has three functions that you can use in a typical web scraping workflow as follows:
fetch_html(url): Retrieves the raw HTML content from the specified URL, handling HTTP requests and potential errors. You can capture the results in astringvariable to use later in your pipeline.parse_content(html, selector, selector_type): Parses the provided HTML content using CSS selectors or XPath to extract specified data. Thehtmlparameter is astringvalue retrieved usingfetch_html(url). This function returns a list of values that you can use in the final part of your pipeline.save_data(data, format, destination): Saves the extracted data into the desired format (e.g., TXT, CSV, JSON) at the specified destination path.
Usage
The examples below demonstrate how to use the main functions in this package. The examples fetch content from the IANA Example Domain:
1. Fetch HTML Content
from dsci524_group29_webscraping import fetch_html
# Fetch the raw HTML content from a webpage
url = "https://example.com"
html_content = fetch_html(url)
print(html_content) # Outputs the HTML content of the page
2. Parse Content
from dsci524_group29_webscraping import parse_content
# Parse the HTML content to extract specific elements
selector = "h1" # Example: extract all <h1> header elements
selector_type = "css" # Use CSS selectors
extracted_data = parse_content(html_content, selector, selector_type)
print(extracted_data) # Outputs a list of the extracted data
3. Save Data
from dsci524_group29_webscraping import save_data
# Save the extracted data to a CSV file
data = [{'value': 'Example Domain'}] # Example data
file_path = save_data(data, format="csv", destination="output.csv")
print(f"Data saved to: {file_path}")
This package simplifies the process of fetching, parsing, and saving web data, making it ideal for beginners. Usage is simplified by following the three simple steps above.
Python Ecosystem
While libraries like BeautifulSoup
and Scrapy offer comprehensive web scraping capabilities,
dsci524_group29_webscraping aims to provide a more streamlined and beginner-friendly approach.
By focusing on three core functions, it abstracts
the complexities involved in web scraping, making
it accessible for quick tasks and educational purposes.
Similar Packages:
webscraping: Provides web scraping functions but contains a rich set of functionality that is beyond beginner level.webscraping_tools: Offers similar functionalities and many more that in our opinion, places it in the intermediate level.
dsci524_group29_webscraping differentiates itself by offering a simple set of functions that do the job for simple, beginner level needs.
Contributors
Lixuan Lin
Hui Tang
Sienko Ikhabi
Contributing
Interested in contributing? Check out the contributing guidelines.
Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by the specified terms.
License
Package dsci524_group29_webscraping was created by Lixuan Lin, Hui Tang and Sienko Ikhabi for the Master of Data Science, University of British Columbia. It is licensed under the terms of the MIT license.
Credits
This project was created with cookiecutter from the py-pkgs-cookiecutter template.