Summarizing huge amounts of data

Question

I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:

blabla1/blabla2/blabla3
word1/word2/word3
...

What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:

blabla1/blabla2/blabla3  : 1
word1/word2/word3        : 3
...

My attempts so far has been simply to save all entries in a dictionary and count them, i.e.

entry_count_dict = defaultdict(int)
with open(file) as f:
    for line in f:
        entry_count_dict[line] += 1

However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice. The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.

In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:

import h5py
import numpy as np

entry_count_dict = h5py.File(filename)
with open(file) as f:
    for line in f:
        if line in entry_count_dict:
            entry_count_file[line][0] += 1
        else:
            entry_count_file.create_dataset(line, 
                                            data=np.array([1]),
                                            compression="lzf")

However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.

I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB). Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.

One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:

sort file.txt > sorted_file.txt

And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.

Any tips on how to approach this would be much appreciated.

I don't think any Python solution you'd come up with will ever be faster than `sort ngrams.txt | uniq -c`, so just stick with it — Nils Werner, Jul 02 '18 at 09:27
There are several modules in python available which could be used to implement dictionary with entries saved or cached on disk instead of in memory. Look at [this question](https://stackoverflow.com/questions/226693/python-disk-based-dictionary) for example. Also consider [google searching for key/value stores](https://www.google.com/search?q=disk+cached++key+value+store) and looking for some that fits your needs i.e. do most of the work in memory as long as it fits memory but uses disks when your "dictionary" is to big to fits into memory. — running.t, Jul 02 '18 at 09:35

score 0 · Answer 1 · answered Jul 02 '18 at 10:15

There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.

What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.

or, try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too) btw it scales

I suggest trying dask.array, which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.

score 0 · Answer 2 · answered Jul 04 '18 at 03:45

I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).

You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:

mkdir output
awk '/./ {print $0 > ( "output/"  substr($0,0,1))}` input/*

This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.

Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:

from collections import Counter

ngrams = Counter()
for line in open(filename):
    ngrams[line.strip()] += 1
for key, val in ngrams.items():
    print(key, val, sep='\t')

If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.

Summarizing huge amounts of data

2 Answers2