2

I have a csv file that has 200 million lines in it.

The best way to load in this file is line by line using a csv reader (as I have many of these files and so parallelising the code later requires not loading in massive datasets and overloading RAM).

I'm trying to count the number of occurrences of values in a certain column and record both their values and frequency in a dictionary. For example, count the number of unique IDs in a column and the number of times those IDs appear.

Here is an example of how I would do this:

import csv
from tqdm import tqdm

field_names = ['A','B','ID','C','D']
filename = '/PATH/file'

ID_dict = {}
with open(data_path+filename) as f:
    reader = csv.DictReader(f,field_names,delimiter=',')
    for row in tqdm(reader):
        label = row['ID']
        if label not in ID_dict.keys():
                ID_dict[label] = 0
        ID_dict[label] += 1

So the column labelled 'ID' is what I'm interested in here, but imagine it has ~200 million entries in it.

Iterating through all those lines and populating a dictionary is slow (it would take around 10 hours on my machine).

Alternatively, appending the values to a new array and then using Counter to find the number of appearances of each unique element also takes too long. (see How do I count unique values inside a list)

Is there a much faster way of doing this that I'm missing? Perhaps there is a Pandas way that is faster? Thanks in advance

user8188120
  • 883
  • 1
  • 15
  • 30

3 Answers3

3

Don't use DictReader(). DictReader() does a lot of work converting a row to a dictionary, with configurable handling of missing and extra columns, that you really don't need here. Just use a regular reader and access the 3rd column of each row.

You can further speed this up by using a Counter() object to begin with (it'll automatically handle the 0 case for you). You may be able to get a very minor speed boost by opening the file with newline=''; the CSV module recommends you do this anyway as it wants to make sure it knows about line endings versus possible embedded newlines in columns.

If you use a map() object and operator.itemgetter(), you can further avoid the evaluation loop overhead, and pass the ids directly to the counter:

import csv
import os.path
from collections import Counter
from operator import itemgetter

filename = '/PATH/file'

with open(os.path(data_path, filename), newline='') as f:
    reader = csv.reader(f)
    id_counts = Counter(map(itemgetter(2), reader))

Still, 200 million rows is a lot of work to handle. I generated 1 million rows of semi-realistic data using Faker, copied those rows 200 times into a new file, and my 2017 model Macbook Pro with SSD processed the resulting 12GB of data in just over 6 minutes with tqdm included, and 5 minutes 14 seconds without. tqdm claims it adds only 60 nanoseconds to each iteration (12 seconds over 200 million rows) but in my tests it appears to be easily 3 or 4 times that number.

Pandas reading the data is going to be about the same speed, as Pandas' read_csv() is built on top of csv.reader(), and the above is as fast as Python can make reading a file with 200 million rows. However, it'll then build a dataframe for those 200 million rows, and that'll take a significant amount of memory to process. You'd have to process your data in chunks and aggregate the results for this to be feasible at all.

Lets do some speed tests, comparing your version (one with and one without the tqdm speed bump), Pandas, and the above approach. We'll use a test set of 10k rows with about 100 unique ids to compare things evenly, without using I/O. This tests just the counting capabilities of each approach. So, setting up the test data and tests; the name=name keyword assignments help avoid global name lookups for repeated tests:

>>> import csv, pandas
>>> from timeit import Timer
>>> from collections import Counter
>>> from contextlib import redirect_stderr
>>> from io import StringIO
>>> from operator import itemgetter
>>> from random import randrange
>>> from tqdm import tqdm
>>> row = lambda: f",,{randrange(100)},,\r\n"  # 5 columns, only care about middle column
>>> test_data = ''.join([row() for _ in range(10 ** 4)])  # CSV of 10.000 rows
>>> field_names = ['A', 'B', 'ID', 'C', 'D']
>>> filename = '/PATH/file'
>>> tests = []
>>> def as_test(f):
...     tests.append((f.__name__, f))
...
>>> @as_test
... def in_question(f, csv=csv, tqdm=tqdm, field_names=field_names):
...     ID_dict = {}
...     reader = csv.DictReader(f, field_names, delimiter=',')
...     for row in tqdm(reader):
...         label = row['ID']
...         if label not in ID_dict.keys():
...                 ID_dict[label] = 0
...         ID_dict[label] += 1
...
>>> @as_test
... def in_question_no_tqdm(f, csv=csv, tqdm=tqdm, field_names=field_names):
...     ID_dict = {}
...     reader = csv.DictReader(f, field_names, delimiter=',')
...     for row in reader:
...         label = row['ID']
...         if label not in ID_dict.keys():
...                 ID_dict[label] = 0
...         ID_dict[label] += 1
...
>>> @as_test
... def pandas_groupby_count(f, pandas=pandas, field_names=field_names):
...     df = pandas.read_csv(f, names=field_names)
...     grouped_counts = df.groupby('ID').count()
...
>>> @as_test
... def pandas_value_counts(f, pandas=pandas, field_names=field_names):
...     df = pandas.read_csv(f, names=field_names)
...     counts = df['ID'].value_counts()
...
>>> @as_test
... def counter_over_map(f, csv=csv, Counter=Counter, ig2=itemgetter(2)):
...     reader = csv.reader(f)
...     id_counts = Counter(map(ig2, reader))
...

and the running the timed tests:

>>> for testname, testfunc in tests:
...     timer = Timer(lambda s=StringIO, t=test_data: testfunc(s(t)))
...     with redirect_stderr(StringIO()):  # silence tqdm
...         count, totaltime = timer.autorange()
...         print(f"{testname:>25}: {totaltime / count * 1000:6.3f} microseconds ({count:>2d} runs)")
...
              in_question: 33.303 microseconds (10 runs)
      in_question_no_tqdm: 30.467 microseconds (10 runs)
     pandas_groupby_count:  5.298 microseconds (50 runs)
      pandas_value_counts:  5.975 microseconds (50 runs)
         counter_over_map:  4.047 microseconds (50 runs)

The combination of a DictReader() and Python for loop is what really causes your version to be 6 to 7 times as slow. tqdm's overhead has dropped to 0.3 nanosecond with stderr suppressed; dropping the with redirect_stderr() context manager made output more verbose and increases the time to 50 microseconds, so about 2 nanoseconds per iteration:

>>> timer = Timer(lambda s=StringIO, t=test_data: tests[0][1](s(t)))
>>> count, totaltime = timer.autorange()
10000it [00:00, 263935.46it/s]
10000it [00:00, 240672.96it/s]
10000it [00:00, 215298.98it/s]
10000it [00:00, 226025.18it/s]
10000it [00:00, 201787.96it/s]
10000it [00:00, 202984.24it/s]
10000it [00:00, 192296.06it/s]
10000it [00:00, 195963.46it/s]
>>> print(f"{totaltime / count * 1000:6.3f} microseconds ({count:>2d} runs)")
50.193 microseconds ( 5 runs)

Pandas is keeping up well here however! But without chunking the gigabytes of memory needed to read all 200 million rows of data into memory (with the actual dataset, not empty columns as I produced here) is going to be a lot slower and perhaps not something your machine can actually carry. Using a Counter() doesn't require gigabytes of memory here.

If you need to do more processing of the CSV dataset, then using SQLite is going to be a good idea too. I wouldn't even use Python then; just use the SQLite command line tool to import the CSV data directly:

$  csvanalysis.db
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
sqlite> CREATE TABLE csvdata (A, B, ID, C, D);
sqlite> CREATE INDEX csvdata_id on csvdata (ID);
sqlite> .import /path/to/file.csv csvdata
sqlite> SELECT ID, COUNT(*) FROM csvdata GROUP BY ID;

etc.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Wow, thanks so much that was a really helpful comparison! I agree pandas seems like a good idea but having to preload the entire dataset using value_counts() just isn't worth it if the user wants to work in parallel loading many of these files at once..the memory requirement would be just too large. – user8188120 Nov 29 '18 at 14:34
  • Is there an easy adaptation of this method where the function doesn't count the number of entries per key but rather calculates an ongoing sum of the values per key? – user8188120 Nov 29 '18 at 15:17
  • @user8188120: what values are we talking about? You could use a `summed = defaultdict(int)` dictionary to collect sums per id, then `summed[row[2]] += int(row[some_other_column_index])` to keep such a running total per ID. – Martijn Pieters Nov 29 '18 at 15:26
  • Using defaultdict(int) returns an error for me... I meant if you had values in column labelled 'B' that you wanted to keep a running total for but each 'ID' has its own resulting total at the end. So reading in the csv line by line, the function would sum the values of 'B' for every unique 'ID' and by the end you would have a dictionary much like you do with Counter, but where rather than the values being 'number of occurrences', they are totals instead. Which I think is what you're saying as well – user8188120 Nov 29 '18 at 15:35
  • Ah I think you're saying instantiate summed as summed = {} yes? – user8188120 Nov 29 '18 at 15:40
  • @user8188120: `defaultdict(int)` is not the problem if you are having errors. I can't help you very well in comments, it sounds as if you may want to post a new question. – Martijn Pieters Nov 29 '18 at 15:51
  • I've made a new question based on this topic -if you're interested you can find it here - https://stackoverflow.com/questions/53543951/calculate-running-total-from-csv-line-by-line – user8188120 Nov 29 '18 at 16:54
2

Try to convert the cvs files into a sql database, e.g. each file represents a table, in a one-time pre-processing step.

Searching in a single column would reduce to a sql-query. Memory optimisation is take care of by the datebase-engine.

Since you use python, I would suggest sqlite3 (import sqlite3).

Bruno Ranieri
  • 655
  • 6
  • 9
  • This seems like the best way to do this in the long run if it drastically reduces load and runtime so I'll give it a try is @wich 's answer above is not fast enough, thanks! – user8188120 Nov 29 '18 at 13:04
  • Use this if you have *more than one query to ask*. sqlite3 still has to read the CSV data first, and build its indexes. If you only need to count the ids once per input, this is only going to be slower. – Martijn Pieters Nov 29 '18 at 13:21
0

How about:

1) df.groupby('ID').count()

or

2) df['ID'].value_counts()

See also: When is it appropriate to use df.value_counts() vs df.groupby('...').count()?

Then you could use the data and create a dictionary from two lists that already have tallied the entries.

rich
  • 520
  • 6
  • 21