Common Crawl requirement to power a decent search engine

Question

Common Crawl releases massive dataloads every month, sizing nearly hundreds of terabytes. This has been going on for last 8-9 years.

Are these snapshots independent (probably not)? Or do we have to combine all of them to be able to power a decent search engine where results from a wide spectrum of webpages are shown? The size of all payloads in Common Crawl repo history (they have not specified size for most of the 2016 payloads) is around 20 PB, add to it 2016 approximation, it becomes around 22 PB. How much of it is possibly duplicate data? If we strip all HTML tag and other nonsense data from the HTML pages, how much can the new data (just the raw text content) look like in size?

If there was a webpage from New York Times present in the payload in 2015 March, what are the odds that they have since then appeared in multiple payloads (I have read the Jacard number reports, but they don't paint a very clear picture) and that a massive number of such pages will be duplicated across all payloads, needing fair amount of pruning?

VonC · Accepted Answer · 2023-05-29T20:26:07.593

Common Crawl is a nonprofit organization that crawls the web and freely provides its archives for research, with the goal of democratizing access to web data.
To your questions:

Are the snapshots independent? The data sets provided by Common Crawl are essentially snapshots of the web at the time the data was crawled. While there is some overlap between snapshots due to the nature of the web (the same URLs may be crawled multiple times over different crawling periods), each snapshot is not entirely dependent on the others. You could theoretically use a single snapshot to power a search engine, but the results would only be as recent as the snapshot. Using multiple snapshots would give you a wider breadth of data and more recent information.
That is how, for instance, a filtered version of Common Crawl was used to train OpenAI's GPT-3 language model.
How much of the data is duplicate? Given the nature of web crawling, there is likely to be a significant amount of duplication in the data. This is due to the fact that the same URLs might be crawled multiple times over different periods, and the content on those URLs might not have changed significantly. In addition, many web pages contain similar or identical content, such as headers, footers, and navigation menus.
How much can the data be reduced by stripping HTML tags and other non-essential data? Stripping HTML tags and other non-essential data can significantly reduce the size of the data. However, it is difficult to give a precise estimate without analyzing the data. In general, the text content of a web page is typically a small fraction of the total data, especially for pages that make heavy use of images, videos, and other media.
You can see it is not the only pre-processing done to that type of data in parse_cc_index.py. It includes a def remove_html_tags(text) method, but also removing special characters, non-printable characters, ...
What are the odds that a webpage from the New York Times present in the payload in 2015 March has since then appeared in multiple payloads? This is difficult to answer without analyzing the data, but it is very possible that a popular URL like a New York Times article would be crawled multiple times over different periods. The extent of the duplication would depend on a variety of factors, including the frequency of the crawls, the breadth of the crawls (i.e., how many URLs are crawled), and the specifics of the crawling algorithm (i.e., how it decides which URLs to crawl).

When building a search engine using Common Crawl data, it would be necessary to have a strategy for dealing with duplicate content. This could involve a variety of techniques, such as deduplication (i.e., removing identical or near-identical documents), clustering (i.e., grouping similar documents together), and ranking (i.e., deciding which documents to show first in the search results). These techniques can help ensure that the search engine provides relevant and diverse results, rather than simply returning the same content over and over again.

The wet files seem to have the stripped down content from HTML - exactly what I think I need.

Assuming that CC wet file data is used in Elasticsearch, how much RAM-HD combination would be optimal per ES node to develop an overall good performance on the entire cluster?
Once, I have read is 64 GB of RAM per machine, but any idea how much ES storage should that machine support in primary shard (let's assume 1 backup shard per machine) for optimal performance, beyond which data should spill over to the next machine?

That would involve Elasticsearch optimization, as illustrated with "Retrieving and indexing a subset of Common Crawl domains with Spark and ElasticSearch, on Azure" in Oct. 2020 by Spiros Politis.

Said optimizations would include:

That list of criteria (and their immediate impact on "RAM-HD combination would be optimal per ES node to develop an overall good performance on the entire cluster", preceded with =>)

Memory: Elasticsearch performs best when it can keep the entire index (or as much of it as possible) in memory. This is because Elasticsearch, like other Lucene-based search engines, relies heavily on the file system cache of the operating system to ensure fast access to frequently requested data. As such, it is typically recommended to have as much memory as you can afford, with a general guideline of at least 64GB of RAM per node.

=> More RAM allows for a larger portion of the index to be cached in memory, which can significantly speed up query performance by reducing the need for disk I/O operations.
Storage: The amount of storage will be dictated by the size of your data. Elasticsearch recommends SSDs over spinning disks for their superior I/O performance. The storage should be several times larger than the size of your data, to allow for backups, reindexing, and growth over time.

=> Fast, high-capacity storage (like SSDs) can reduce the time it takes for Elasticsearch to read from and write to disk, which can be particularly beneficial for indexing and search operations that ca not be served from memory.
Sharding: The number of primary shards for an index should be determined based on the amount of data in the index. As a general rule, you want your shard size to be between a few GB to a few tens of GB, and definitely under 100GB. Anything beyond that and you start to see diminishing returns in terms of performance.
This means if you expect your index to be 1TB in size and you want your shard size to be around 50GB, you'd want around 20 primary shards.

=> Proper sharding can ensure data is evenly distributed across nodes, which can improve query performance by allowing multiple nodes to work on a query in parallel.
Optimal shard size ensures efficient utilization of resources (RAM, CPU, I/O) in each node, thus enhancing the overall performance of the ES cluster.
Shard allocation: Elasticsearch has the ability to control where shards are placed through various allocation settings.
For example, you can specify that no two copies of the same shard will be placed on the same physical machine, to protect against hardware failure.
You can also control shard balancing across nodes, to prevent any one node from becoming a hotspot.

=> Proper shard allocation can prevent any one node from becoming a bottleneck, ensuring that the cluster can effectively distribute load and make the most of the available hardware.
Heap Size: The JVM heap size should be no more than 50% of your total memory, and absolutely no more than 32GB. This is because Java uses compressed object pointers for heap sizes less than ~32GB. Using more than 50% of your memory for the heap can cause excessive swapping, as it leaves less memory available for the operating system to use for the file system cache.

=> A properly sized JVM heap can prevent memory issues like garbage collection pauses or swapping, which can significantly impact performance.
Setting the JVM heap size to no more than 50% of your total memory ensures that there is sufficient memory for the operating system to use for the file system cache, which Elasticsearch relies on for fast data access.

Remember, these points are interconnected.

For example, having a lot of memory is less beneficial if your storage is slow, as it will take a long time to load data into memory.
Similarly, fast storage is less beneficial if you do not have enough memory to cache frequently accessed data.

Therefore, it is important to consider all these points together when deciding on your hardware setup.

In terms of how to distribute data across machines, you would ideally want to add nodes as your data grows such that you maintain a balanced shard size across your cluster.
For example, if each node can comfortably handle 1TB of data and you expect your data to grow to 2TB, you'd want to add a second node when your data gets close to 1TB.

Given that you mentioned a system with 64 GB of RAM, let's start by setting the heap size: as per Elasticsearch's guidelines, you should set the JVM heap size to be no more than 50% of your total memory, and absolutely no more than 32GB. So, in this case, you would set the heap size to 32GB. This leaves the remaining 32GB for the operating system and file system cache.

Regarding the amount of storage that this machine should support for optimal performance, it depends on various factors, but a rule of thumb is to keep the shard size between a few GB to a few tens of GB, and definitely under 100GB.
This suggests that with the given memory, assuming your machine can handle multiple shards, your storage should be able to handle hundreds of GB to a few TB of data.

For example, if you have 20 primary shards per machine, and each shard is 50GB, you would need 1TB of storage.
This does not account for replicas (which you should have for redundancy), growth over time, or the space needed for Elasticsearch to perform operations like segment merging.

As such, you might want to double or even triple that number to give yourself some breathing room.

Again, this is a very rough estimation and the actual optimal setup can vary based on your specific use case and workload.
You may need to adjust these recommendations based on factors such as your data size, query volume, and the complexity of your queries.
You should always monitor your cluster performance and adjust your setup as necessary based on what you observe.

I read somewhere (cannot find out where right now) that a node size of 3-4 TB (divided among multiple shards) works well with 64 GB RAM. Is that a good approximation?

Yes, you can find a similar estimation in this presentation: "Experiences in ELK with D3.js for Large Log Analysis and Visualization".

RAM is more related to what - number of primary shards and their per-shard size, or the total number of shards and their combined size on a machine, given that the RAM allocated is to the machine in total and that it must operate on all shards together to give the best performance?

The use of RAM in Elasticsearch is related to both the total size of the data and the number of shards. Meaning:

Total Size of Data: A larger total size of data means more information to index and search. The more of this data that can be kept in the system's memory (as part of the file system cache), the faster Elasticsearch can access it, which improves query performance.
Number of Shards: Each shard in Elasticsearch is a separate Lucene index, which has its own set of resources including memory.
Having more shards can increase the memory overhead, as each shard has some level of fixed overhead. This overhead includes things like the data structures used for indexing and searching, as well as the resources needed for Elasticsearch to manage the shard.
Too many shards can lead to a waste of resources because of this overhead, which can in turn lead to performance problems.
See "Size your shards" for more.

So, RAM is related to both the total size of the data on the machine (because more data requires more memory to cache) and the number of shards (because more shards require more memory for overhead).
Balancing these two aspects is crucial for achieving optimal performance.

In your case, with 64 GB RAM, it would be beneficial to ensure that the total size of your data (taking into consideration both primary and replica shards) is manageable within this memory limit, and that you are not maintaining more shards than necessary.
The exact numbers can depend on the specifics of your workload and data, and you may need to adjust based on the performance you observe.

Will the difference between 64 GB and 32 GB RAM be significant enough?
If so, where will the difference hit the most - runtime search time perf?
Or indexing perf when more pages are added to the ES store?

The difference between 64GB and 32GB of RAM can indeed be significant, but the impact will be mostly felt on:

Search Performance: More memory can significantly improve search performance because Elasticsearch relies heavily on the operating system's file system cache to quickly access frequently requested data.
With more RAM, more of the index can be kept in memory, reducing the need for slow disk I/O operations.
So, if your use case involves complex queries or you need to maintain low-latency responses, the additional memory can be beneficial.
In general, you should make sure that at least half the available memory goes to the filesystem cache so that Elasticsearch can keep hot regions of the index in physical memory.
Indexing Performance: More memory can also improve indexing performance. Indexing involves creating and updating the data structures that Elasticsearch uses to perform searches. These operations can be memory-intensive, particularly if you are indexing a large amount of data at once.
With more RAM, these operations can be performed more quickly and efficiently.

Overall, if your dataset and query volume can comfortably fit within 32GB of RAM, then you may not see a significant performance difference with 64GB. But if your dataset or query volume is larger, or you expect it to grow in the future, then the additional memory can provide a significant performance boost.

Specifically, if there is indeed a runtime performance hit for searching (say it becomes 0.3 seconds slower as compared to what would happen for 64 GB RAM), will that be evident for all searches all the time, or only for all searches once the search rate becomes high, like say 10,000 queries per hour?

if you see a 0.3 second slowdown in search performance when going from 64GB to 32GB of RAM, this slowdown might not be evident for all searches all the time. It could be more noticeable for complex queries, or during periods of high query rate. Additionally, if your data grows over time, you might start to see a performance impact even at lower query rates.

An actual answer will involve monitoring your system's performance and adjusting your setup as necessary based on what you observe.
Remember that hardware is just one piece of the puzzle - you should also consider other factors like your data model, your queries, and your Elasticsearch configuration.

Thanks a lot. That perfectly answers my question. One additional question - the `wet` files seem to have the stripped down content from HTML - exactly what I think I need. Assuming that CC `wet` file data is used in Elasticsearch, how much RAM-HD combination would be optimal per ES node to develop an overall good performance on the entire cluster? One I have read is 64 GB of RAM per machine, but any idea how much ES storage should that machine support in primary shard (let's assume 1 backup shard per machine) for optimal performance, beyond which data should spill over to the next machine? — SexyBeast, May 29 '23 at 17:33
@SexyBeast Good question. I have compiled a list of criteria which should inform your setup. See the edited answer. — VonC, May 29 '23 at 18:50
Thanks @VonC. This is truly helpful! I read somewhere (cannot find out where right now) that a node size of 3-4 TB (divided among multiple shards) works well with 64 GB RAM. Is that a good approximation? RAM is more related to what - number of primary shards and their per-shard size, or the total number of shards and their combined size on a machine, given that the RAM allocated is to the machine in total and that it must operate on all shards together to give best performance? — SexyBeast, May 29 '23 at 19:06
@SexyBeast I believe it is a good approximation, reading [this presentation](https://www.slideshare.net/SurasakSanguanpong/experiences-in-elk-with-d3js-for-large-log-analysis-and-visualization#54). See also "[How many shards should I have in my Elasticsearch cluster?](https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster)" — VonC, May 29 '23 at 19:12
@SexyBeast I have edited the answer to address your comment/questions. — VonC, May 29 '23 at 19:17
Hi @VonC, one last question - Will the difference between 64 GB and 32 GB RAM be significant enough? If so, where will the difference hit the most - runtime search time perf? Or indexing perf when more pages are added to the ES store? — SexyBeast, May 29 '23 at 19:57
Specifically, if there is indeed a runtime perf hit for searching (say it becomes 0.3 seconds slower as compared to what would happen for 64 GB RAM), will that be evident for all searches all the time, or only for all searches once the search rate becomes high, like say 10,000 queries per hour? — SexyBeast, May 29 '23 at 20:17
@SexyBeast I have edited the answer to address your last two comments. — VonC, May 29 '23 at 20:26

Common Crawl requirement to power a decent search engine

1 Answers1