Comparing large files with grep or python

Question

I have two lists of urls and I want to know new string. Example:

listA.txt
string1
string2

listB.txt
string1
string3

Then I compare both lists, to know the new string in list B:

grep -w -f listA.txt -v listB.txt

or

cat listA.txt | grep -Fxvf - listB.txt

final result:

string3

The problem is that i have a millions of strings, so running the command consumes all the resources of my PC and collapses.

Is there any way to do this with python (which consumes fewer resources and is faster)

thanks

The [useless use of `cat`](http://www.iki.fi/era/unix/award.html) probably isn't the straw which breaks the camel's back here but still a minor inefficiency. Your first `grep` command avoids this; the second could be rephrased as `grep -Fxcf - listB.txt — tripleee, Aug 08 '17 at 04:53
Why do you think Python will be more efficient? The strings will take up the memory they take up unless you can tell Python something you can't tell `grep` (like for example that the strings represent something which can be stored much more compactly in memory, like numbers or hashes). — tripleee, Aug 08 '17 at 04:57

Alexander · Accepted Answer · 2017-08-07T19:08:06.383

0

This method creates a set from the first file (listA). The the only memory requirement is enough space to hold this set. It then iterates through each url in the listB.txt file (very memory efficient). If the url is not in this set, it writes it to a new file (also very memory efficient).

filename_1 = 'listA.txt'
filename_2 = 'listB.txt'
filename_3 = 'listC.txt'
with open(filename_1, 'r') as f1, open(filename_2, 'r') as f2, open(filename_3, 'w') as fout:
    s = set(val.strip() for val in f1.readlines())
    for row in f2:
        row = row.strip()
        if row not in s:
            fout.write(row + '\n')

edited Aug 07 '17 at 19:08

answered Aug 07 '17 at 19:03

Alexander

105,104
32
201
196

Thanks. This question may interest you: https://stackoverflow.com/questions/45572807/debugging-lists-with-python – acgbox Aug 08 '17 at 15:50

score 0 · Answer 2 · answered Aug 07 '17 at 19:05

If you have sufficient memory, read the files in to two lists. Then convert the lists to sets ie setA = set(listA) then you can use the various operators available with Python sets to do whatever operations you like e.g. setA - setB

I've used it before and it's very efficient.

score 0 · Answer 3 · answered Aug 07 '17 at 19:06

You will want to follow the solution here:

Get difference between two lists

But first, you will need to know how to load the file into a list, which is here:

How do I read a file line-by-line into a list?

Good luck. So something like this:

with open('listA.txt') as a:
    listA = a.readlines()
a.close()
with open('listB.txt') as b:
    listB = b.readlines()
b.close()
diff = list(set(listB) - set(listA))

#One choice for printing
print '[%s]' % ', '.join(map(str, diff))

tripleee · Answer 4 · 2017-08-08T05:10:21.207

If you can't fit even the smaller file into memory, Python is not going to help. The usual solution is to sort the inputs and use an algorithm which operates on just three entries at a time (it reads one entry from one file and one from the other, then based on their sort order decides which file to read from next. It needs to keep three of them in memory at any time to decide which branch to take in the code).

GNU sort will fall back to disk-based merge sort if it can't fit stuff into memory so it is basically restricted only by available temporary disk space.

#!/bin/sh
export LC_ALL=C # use trad POSIX sort order
t=$(mktemp -t listA.XXXXXXXX) || exit 123
trap 'rm -f $t' EXIT HUP INT
sort listA.txt >"$t"
sort listB.txt | comm -12 "$t" -

If the input files are already sorted, obviously comm is all you need.

Bash (and I guess probably also Zsh and ksh) offers process substitution like comm <(sort listA.txt) <(sort listB.txt) but I'm not sure if that's robust under memory exhaustion.

As I'm sure you have already discovered, if the files are radically different size, it makes sense to keep the smaller one in memory regardless of your approach (so switch the order of listA.txt and listB.txt if listB.txt is the smaller one, here and in your original grep command line; though I guess it will make less of a difference here).

Comparing large files with grep or python

4 Answers4

Linked