Use Common Crawl to access web data

What is Common Crawl?

Common Crawl is a non-profit organization that crawls the web and freely provides its archives and datasets to the public. The organization was founded in 2007.

The Common Crawl corpus is a petabyte-scale archive of web pages and metadata. The dataset is hosted on Amazon Web Services (AWS) and is freely accessible to the public. The web pages in the dataset are continuously updated and are made available for download on a regular basis.

It’s the perfect tools to build a dataset for your AI project.

The dataset

The dataset is stored in Amazon S3. You can access the dataset using S3 API or by plain HTTP. S3 access is more optimized and recommended.

The dataset is split into multiple files. Each file is a part of the dataset and contains a subset of the web pages.

How to access the dataset

First go to https://commoncrawl.org/get-started

And select the last dataset available: dataset

types

You can download the whole dataset but it’s unlikely that you have the storage to store it. But you don’t need to download the whole dataset if you are interested in a specific part of the dataset. You can download only the pages that you need.

To extract a page from the dataset you need:

  • The segment file of the page
  • The offset of the page in the segment file
  • The length of the page in the segment file

This information is stored in the index of the dataset. The index is a file that contains the location of each URL in the dataset.

Once you have this information, you can perform an HTTP range request to download the page.

import gzip
import io

import requests

segment_file = 'crawl-data/CC-MAIN-2023-50/segments/1700679100499.43/warc/CC-MAIN-20231203094028-20231203124028-00893.warc.gz'
offset = 225013702
length = 7069

url = f'https://data.commoncrawl.org/{segment_file}'
response = requests.get(URL, headers={'Range': f'bytes={offset}-{offset+length-1}'})
if response.status_code == 206:
    content = response.content
    # Decompress the data
    with gzip.open(io.BytesIO(content), 'rb') as f:
        content = f.read()
    print(content)
else:
    print(f"Failed to fetch data: {response.status_code}")

How to get the location of a page in the dataset

Use the Common Crawl Index API

The Common Crawl Index API is a service that provides a simple interface to search the Common Crawl corpus. The API allows you to search for web pages that match a specific query.

The API is very simple you just pass the URL of the page that you are looking for and the API will return the location of the page in the dataset.

For example, for the page https://commoncrawl.org/faq you can use the following request:

http://index.commoncrawl.org/CC-MAIN-2023-50-index?url=commoncrawl.org%2Ffaq&output=json

You can replace CC-MAIN-2023-50 by the last dataset available.

{"urlkey": "org,commoncrawl)/faq", "timestamp": "20231203094453", "url": "https://commoncrawl.org/faq", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "E6N62SALJEROKFK4BVRK523WLDBV67RW", "length": "7069", "offset": "225013702", "filename": "crawl-data/CC-MAIN-2023-50/segments/1700679100499.43/warc/CC-MAIN-20231203094028-20231203124028-00893.warc.gz", "languages": "eng", "encoding": "UTF-8"}

The Common Crawl foundation provides a full example on how to use the API and retrieve the page: https://gist.github.com/thunderpoot/58a748565d2e5b2582520fa535821908#file-cc_fetch_page-py

This method is the easiest way to get the location of a specific page in the dataset but if you need to get the location of a lot of pages it’s not the best way to do it.

Use the index files

You can download all the index files from the dataset and search the location of the page in the files.

The file cc-index.paths.gz contains the location of all the index files. You can download this file and extract the location of the index files. The whole index size is around 300GB compressed.

Other versions of the index file cc-index-table.paths.gz is available. This is the same data but as Apache Parquet files. This format can be read by tools like DuckDB, Apache Spark, Trino… The usage will be similar to the usage of the AWS Athena.

Use AWS Athena

Athena is a Trino-based serverless interactive query service that makes it easy to analyze large amounts of data in Amazon S3 using standard SQL.

Athena is cost-effective and easy to use. There is no need to set up or manage infrastructure, and you only pay for the queries that you run.

You pay 5$ by TB scanned. If you correctly use the partitioning of the dataset the cost can be very low.

Setup Athena

You need to select the region US-East-1 (N. Virginia) to access the common crawl dataset.

Open the query editor and create a new database:

CREATE DATABASE ccindex

Next you need to create a table by running the following query: https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/athena/cc-index-create-table-flat.sql

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  URL                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_host_name_reversed        STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  fetch_redirect                STRING,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  content_charset               STRING,
  content_languages             STRING,
  content_truncated             STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED BY (
  crawl                         STRING,
  subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';

The table is not created but if you try a query it’s not going to work.

You need first to repair the table:

MSCK REPAIR TABLE ccindex

You will also need to do that when a new dataset is available.

Run a query

First you can try that the dataset is correctly loaded by running the following query:

SELECT * FROM ccindex LIMIT 1

alt text

When you do a query make sure to pay attention to the cost of the query.

alt text

To reduce the cost makes sure to use the crawl and subset columns in your query. This will reduce the amount of data scanned because the dataset is partitioned by crawl and subset.

Also use LIMIT to reduce the amount of data returned by the query.

This request will return to the location of all the pages of the website trino.io:

SELECT URL,
       warc_filename,
       warc_record_offset,
       warc_record_length
FROM ccindex
WHERE crawl = 'CC-MAIN-2023-50'
AND subset = 'warc'
AND url_host_name = 'trino.io'
AND content_mime_type = 'text/html'
LIMIT 1000

alt text

You can after export it as a CSV file and use it to download the pages.

Conclusion

Common Crawl is amazing tool to access web data without the need to crawl the web yourself. The dataset is huge and it’s perfect for bootstrapping AI projects.