comparing large csv files

Question

Which method is more efficient for comparing two large (8GB & 5GB) csv files? The output should contain every id that is not in file1.

The data is a single column with GUIDs.

Method 1:

df = pd.read_csv(file)
df1 = pd.read_csv(file1)

df = df.merge(df1, on=['id'], how="outer", indicator=True).query('_merge=="left_only"')
df['id'].to_csv(output_path, index=False)

Method 2:

with open(file1, 'r') as t1:
    file = set(t1)

with open(file, 'r') as t2, open(output_path, 'w') as outFile:
    for line in t2:
        if line not in file:
            outFile.write(line)

What type of efficiency are you looking for? Is it about speed of execution, memory usage or something else? — Matteo Zanoni, Sep 15 '22 at 15:36
Also what are those ids? This may matter when considering how to store them... (if they are numeric you will use less memory if you transform the strings into ints — Matteo Zanoni, Sep 15 '22 at 15:48

Massifox · Accepted Answer · 2022-09-15T16:07:59.603

1

What do you mean by efficiency? Certainly two major differences are as follows:

The first method, which pandas uses, needs to have all the data in memory. So you will need an amount of available memory to hold the data from the two csv files (note: 5+8gb may not be enough, but it will depend on the type of data in the csv files).
The second method takes advantage of python's generators, and reads the file line by line, loading into memory one line at a time.

So if you have the memory available to load the data into memory, it will certainly be faster to load all the data into memory and do the operations on the data in memory.

If you don't have enough memory available, the second method works but is definitely slower. a good compromise might be to read the file by chunk, loading into memory an amount of data that your hardware can handle.

Extras

To estimate the memory space used by your datframe you can read this nice post:

How to estimate how much memory a Pandas' DataFrame will need?

Here you can find approndments explaining how to read a file by chunk, with or without pandas

edited Sep 15 '22 at 16:07

answered Sep 15 '22 at 15:39

Massifox

4,369
11
31

Your point is correct but I think you might be wrong about pandas needing 13GB... The csv is stored as text (probably utf8 so one byte per character). Pandas instead will store numbers as some form of int (32 or 64 bit) so the sizes will not match (pandas will store it in less memory than the csv) – Matteo Zanoni Sep 15 '22 at 15:41
Ciao Matteo, I have added other considerations to my answer – Massifox Sep 15 '22 at 15:45
1

Yes you are right, it also depends on the type of data contained in the csvs. However the concept does not change, whether it is 10gb or 32gb is all relative to the HW you have. The important thing is to know what to do (and how to approach the problem ) when not all the data can fit in memory :) – Massifox Sep 15 '22 at 15:50
2

Definitely, you are 100% correct. Just wanted to clarify this small point. Your explanation is still valid – Matteo Zanoni Sep 15 '22 at 15:54
Of course, you were right to point it out to me. – Massifox Sep 15 '22 at 15:56
I'm running the first method for 7 hours without any results that's why I'm considering the second method. – jonas Sep 15 '22 at 16:04
1

Ok, so you are asking too much of your hardware. It would be interesting to know if the execution is stuck on loading into memory, or if it is stuck on merge. – Massifox Sep 15 '22 at 16:17

score 0 · Answer 2 · answered Sep 15 '22 at 15:49

0

If this is something you'll have to run multiple times, you can just wrap them with start = time.time() at the beginning and execution_time = time.time() - start at the end to compare speed. To compare memory, you can check out this package, memory_profiler

answered Sep 15 '22 at 15:49

bujian

152
8

1

When using time to compute execution metrics `time.perf_counter()` is better suited – Matteo Zanoni Sep 15 '22 at 15:51
1

Oh interesting, I haven't seen that used much, why is that better suited? – bujian Sep 15 '22 at 15:52
1

`perf_counter` includes time passed in sleeps, is system-wide and has way better precision. Also if floating point rounds are an issue `perf_counter_ns` may be better. Take a look at (https://www.webucator.com/article/python-clocks-explained/) – Matteo Zanoni Sep 15 '22 at 15:58
oooh super cool, thanks matteo! – bujian Sep 15 '22 at 16:13

comparing large csv files

2 Answers2