Applying RAP for price statistics

Overview

As RAP is a set of technical best practices and styles of working, we should visually consider how it applies to price statistics.

Overview of the ADS processing view

Web scraping is the first of several phases necessary to create elementary aggregates –- which are the main input and the building blocks of the Consumer Prices Index.

Overview of all the steps

The below diagram provides an overview of 4 main steps to prepare web scraped data for the CPI: the web scraping step, data processing or preparation step, the classification step, and the price index or aggregation step. Each step outputs a dataset that is used as an input for the next step.

Read diagram from left to right. Starting with the scraping process, four main steps are needed to create input for the CPI
Figure 1: High level overview of the steps to process web scrape data for the CPI

Each step is made up of one or more sub-components:

  • The web scraping step contains the scraping aspect itself,1 but also includes dataset validation that will help make sure that the scraper is operating as it is expected;2
  • The data processing step contains the process to standardizing and preparation step.3
  • The classification step contains several sub-steps, such as identifying unique products to classify, the classification method itself, and the manual validation of classification (in cases there are errors).4
  • For the price index step, many sub-steps are involved.5

Note - this is not an authoritative source as composition of each step could be changed. The idea we are trying to demonstrate is that there are several steps that can be developed separately (i.e. loosely coupled),

How does RAP come into this?

Given this context, RAP can be applied in several ways. As RAP is a way to encapsulate the creation of a statistical (in our case) process into one corpus (i.e. repository on GitHub/GitLab with all necessary documentation and materials to make the process reproducible) – we can:

  1. Encapsulate the whole process in one repository. As RAP is meant to help minimize coupling (dependencies) between separate processes – treating the whole process end-to-end as one RAP is useful if no steps in this pipeline have to be shared with other pipelines (such as with other data) or if this specific data always runs end-to-end.
  2. Encapsulate each step into a separate RAP. This may be appropriate if several processing phases are done in sequence with specific targeted stopping points – such as handoff between teams, if some steps are shared with other pipelines, or where manual steps are necessary. The UK RAP implementation shows how several pipelines operate on top of a data architecture in sequence.6
  3. Encapsulate as appropriate. Several steps could be combined to simplify the process and if separation is not required, such as the web scraping and data processing steps.

For this guide and for the demo RAP scraper, we will follow approach (b) as this will demonstrate RAP for just the scraping component in more details.

Closer look at the web scraping component

So let’s look a little closer at the web scraping component as there is a little more to it than one scrape script that creates an output dataset and a report. We can portray this visually to flush out a bit more what is involved in this step:

Checking robots.txt

As we discussed in the course and in other sources, the legal aspects of scraping is key as NSOs should follow the guidance of retailers whether they can be scraped or not. This involves checking the Terms and Conditions for the site, but also checking the robots.txt for the site. As the retailer site may change its robots.txt, it maybe useful to have an automated component that checks before starting the scraper so that any new exclusions are taken into account.

Scraping the site itself

The scraper itself is usually custom to the retailer site and outputs a file for what was scraped that day. Using RAP principles to develop this well (which we cover in the September 18th session and provide examples of how a notebook can be translated into something more RAP friendly) are off course key so that its stable and when inevitable website changes happen - the scraper can be fixed quickly.

Logging

Technically a component of the scraping pipeline itself (according to RAP principles), however a very valuable part of the scraper worth pointing out separately is logging the operation of the scraper. There may be legal or netiquette requirements (such as the robots.txt asking for a specific delay in each request to the website, or your NSO requiring specific processes), as well as technical (such as easier debugging) reasons where having key things logged in a .log file useful to describe the operations of the scraper. It can also be combined with the monitoring component to better understand when something goes wrong.

Monitoring

As a retailer can choose to change their website at any time - monitoring that the intended data was collected is key. Specifically, logs or error handling in the scraper component may not capture all changes to the site - such as because the retailer chose to stop putting some products on their website. These changes can thus fail to be noticed by the previous checks, but the resultant scraped data may now no longer have the same coverage that the NSO expects - which would impact the statistics calculated with this data. Thus NSOs tend to stand up monitoring for an extra layer of validation. This check is also similar in nature to the checks that NSOs tend to do on scanner data that is received regularly.7

Back to top

Footnotes

  1. ndefinedSee Practical guidelines on web scraping for the HICP (2020), specifically Annex I and VII for mor↩︎

  2. See Monitoring, validation and plausibility checks for web scraped data (UN e-Handbook) for more details↩︎

  3. See Preparation of data (UN e-Handbook) for more handbook.↩︎

  4. For an overview of Classification process in production and other operational aspects, see “Classification of Alternative Data Sources”, 2024-05-14. For an overview of the 5 main classification methods, see “Classifying Alterantive Data Sources for Consumer Prices Statistics: Methods and best practices”, 2023-06-08.↩︎

  5. See “Practical guidelines on web scraping for the HICP (2020)” for more details↩︎

  6. See Price (2023) Developing reproducible analytical pipelines for the transformation of consumer price statistics: rail fares for more details, for instance 5.2 outlines Pipeline tables and how different pipelines interact over a specific data architecture.↩︎

  7. Guðmundsdóttir and Jónasdóttir (2016) Scanner Data: Initial Testing provide a useful overview that could be similar in the web scraping case.↩︎