Making the scraper RAP friendly: some example functions

Published

September 18, 2024

Previously, we saw how to map and plan out a more RAP friendly scraper by developing a functional programming style. Let’s apply some of this to a fraction of the scraper to see how this would work. We will still do this interactively (i.e. in a jupyter notebook), but we will really want to convert this into functions in separate .py files as a next step.

Import what we will need

As before, we should first import the libraries and packages we will need

from bs4 import BeautifulSoup
import requests
import pandas as pd
import logging, sys
import time

Create utility functions that isolate the work

def get_site_data(session, url_to_scrape, header, logger):
    """
    Function to isolate just the scraping portion and log the fact that we called the site
    """
    logger.info(f"Getting data from the site")
    with session.get(url_to_scrape, headers=header) as res:
        response = BeautifulSoup(res.text, "html.parser")
    return response

def get_and_parse_product_page(product_page_url, session, logger, header):
    """
    Function to focus on the parsing of information from the product page

    Note - we would probably want to modularize this further and add other aspects
    like error handling
    """
    response = get_site_data(
        session=session, 
        url_to_scrape=product_page_url,
        logger=logger,
        header=header)
    # get product/book name
    logger.info(f"Parsing product info")
    title = response.title.text.split("|")[0].strip()
    # get product description
    description = response.find_all("div", class_="sub-header")[0].find_next('p').text
    # get product details and extract the full dictionary
    all_tables = pd.read_html(product_page_url)
    data_dict = all_tables[0].set_index(0).to_dict()[1]
    # return the data in the format of UPC, title, description, and price
    return (data_dict['UPC'], title, description, data_dict['Price (incl. tax)'],)

Create a main function for orchestrating the logic

def main(user_agent_string, email, path_to_save_logs, product_url_to_scape):
    """
    To orchestrate (i.e. automate in a specific sequence), we can make a main function
    that calls all the other steps
    """ 
    # Start out by initializing the logging capability.
    logging.basicConfig(
    level=logging.INFO,
    format= '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(f"{path_to_save_logs}{time.strftime('%Y-%m-%d_%H-%M')}.log"),
        logging.StreamHandler(sys.stdout)
        ]
    )
    logger = logging.getLogger(__name__)

    # Log that we are starting!
    logger.info(f"Starting the scrape of the following page: {product_url_to_scape}")

    # Set up the header
    heads = {
    'User-Agent':user_agent_string,
    'email': email,
    'Accept-Language': 'en-US, en;q=0.5'}
    session = requests.Session()

    # call and then return the main function that scrapes and parses the specific site
    return get_and_parse_product_page(
        product_page_url=product_url_to_scape, 
        session=session, 
        logger=logger, 
        header=heads)
    

Now lets try it!

We can try for two products and see what we got

main(
    user_agent_string = "ESCAP Webscraping RAP demo scraper 1.0",
    email = "example@email.com",
    path_to_save_logs = "../data/logs/",
    product_url_to_scape = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
)
2024-09-17 21:51:08,776 - __main__ - INFO - Starting the scrape of the following page: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
2024-09-17 21:51:08,777 - __main__ - INFO - Getting data from the site
2024-09-17 21:51:08,972 - __main__ - INFO - Parsing product info
('a897fe39b1053632',
 'A Light in the Attic',
 "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
 '£51.77')
main(
    user_agent_string = "ESCAP Webscraping RAP demo scraper 1.0",
    email = "example@email.com",
    path_to_save_logs = "../data/logs/",
    product_url_to_scape = "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
)
2024-09-17 21:51:11,241 - __main__ - INFO - Starting the scrape of the following page: https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
2024-09-17 21:51:11,242 - __main__ - INFO - Getting data from the site
2024-09-17 21:51:11,445 - __main__ - INFO - Parsing product info
('90fa61229261140a',
 'Tipping the Velvet',
 '"Erotic and absorbing...Written with starling power."--"The New York Times Book Review " Nan King, an oyster girl, is captivated by the music hall phenomenon Kitty Butler, a male impersonator extraordinaire treading the boards in Canterbury. Through a friend at the box office, Nan manages to visit all her shows and finally meet her heroine. Soon after, she becomes Kitty\'s "Erotic and absorbing...Written with starling power."--"The New York Times Book Review " Nan King, an oyster girl, is captivated by the music hall phenomenon Kitty Butler, a male impersonator extraordinaire treading the boards in Canterbury. Through a friend at the box office, Nan manages to visit all her shows and finally meet her heroine. Soon after, she becomes Kitty\'s dresser and the two head for the bright lights of Leicester Square where they begin a glittering career as music-hall stars in an all-singing and dancing double act. At the same time, behind closed doors, they admit their attraction to each other and their affair begins. ...more',
 '£53.74')

Perfect. If we look at the .log file created in /data/log/ we see

2024-09-17 21:42:55,351 - __main__ - INFO - Starting the scrape of the following page: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
2024-09-17 21:42:55,352 - __main__ - INFO - Getting data from the site
2024-09-17 21:42:55,571 - __main__ - INFO - Parsing product info
2024-09-17 21:42:59,238 - __main__ - INFO - Starting the scrape of the following page: https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
2024-09-17 21:42:59,239 - __main__ - INFO - Getting data from the site
2024-09-17 21:42:59,385 - __main__ - INFO - Parsing product info

Which means that we are logging our operations as well!

Back to top