Fast way of comparing lines in 2 files?

Question

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

For example, if this is file1:

A=1
B=2
C=3

And this is file2:

A=10
B=20
C=30
D=5

Then my result/output should be:

D=5

Since there is no D=Something in File1.

You should decide whether this is a python or bash question. It can't be both. — oguz ismail, Jun 15 '21 at 09:51
Does this answer your question? [Fast way of finding lines in one file that are not in another?](https://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another) — mihi, Jun 15 '21 at 10:35
This has been marked as a duplicate, but it isn't a duplicate. The linked question is about finding matching whole lines. This question is about matching partial lines. Similar, but not a duplicate, and the answers to the other question need adapting to work here. — Chris Lear, Jun 15 '21 at 11:24
`awk -F= 'NR==FNR{a[$1]=1; next} !a[$1]' file1 file2 file3...` This will print the lines which are not in file1. You can supply any no of files after file1. — BarathVutukuri, Jun 15 '21 at 11:24
@Chris Lear Ah sorry :/. Based my comment on the question text. Don't have the privileges to cast a reopen vote though. — mihi, Jun 15 '21 at 11:40

jacobscgc · Answer 1 · 2021-06-15T09:43:18.947

2

You could read file 1 to a list, then read file 2 and for each entry check whether it is in that list.

filelist1 = []
with open(file1, 'r') as f:
    for line in f:
        filelist1.append(line.split('=')[0])

with open(file2, 'r') as f:
    for line in f:
        if line.split('=')[0] not in filelist1:
            print(line)

Should do the trick.

edited Jun 15 '21 at 09:43

answered Jun 15 '21 at 09:11

jacobscgc

31
2

filelist1 = set() might be faster – Aaron Jun 15 '21 at 09:21
@Aaron, you're right, not only faster but also more concise since you can just do `set1 - set2` – fsimonjetz Jun 15 '21 at 09:23
No sadly A=10 B=20 and C=30.So i cant do this. – mutatedzombie Jun 15 '21 at 09:30
So what exactly has to be equal? The "X=" part? – fsimonjetz Jun 15 '21 at 09:33
Depending on what you want to achieve exactly, but you could add: if line[:3] not in filelist1 Which would work in the scenario you show. – jacobscgc Jun 15 '21 at 09:38
Actually what i want is the extra D=something that file1 does not have while file 2 has.File 1 doesnt have any D=something. – mutatedzombie Jun 15 '21 at 09:38
Do you mean you just want to know that there is an 'A', 'B' and 'C' in both files, but the 'D' only is present in file 2? – jacobscgc Jun 15 '21 at 09:42
yes correct.Thts correct jacob. – mutatedzombie Jun 15 '21 at 09:43

Chris Lear · Answer 2 · 2021-06-15T10:53:21.670

This works for the example given, and is quite simple:

grep -v -F "`grep -o .*= file1`" file2

I tried it on an artificially-created 30000 line file, and it was fast.

It just uses grep -o to create a list of matches which is then fed into grep -F as fixed strings. Then -v is used to say 'show lines that don't match'

Some caveats:

this is case sensitive, so A=10 is different from a=10.
there's an assumption that there is exactly one '=' sign on any line that's significant in file1, and that everything to the left of it (including spaces) is part of the check.
Probably a bug if file1 contains A=10 and file2 contains AAA=10, A=10 will be found in file2, and therefore that line won't be reported. I'll try to rewrite the one-liner to fix this bug

Another option, which is simpler and actually nicer

join -t= -v 2 <(sort file1) <(sort file2)

This one requires file1 and file2 to be sorted first, but doesn't show the bug that the grep version above shows. It's also probably faster (I haven't really checked). The other caveats above still apply.

fsimonjetz · Answer 3 · 2021-06-15T09:46:30.720

0

Something like this would work but it would only give you "D", not the complete line - depends on what you need downstream and what exactly your files look like. Also, as mentioned by Aaron, it should be quite fast.

with open("path/to/file1") as f1, open("path/to/file2") as f2:
    s1 = set(x.split("=")[0] for x in f1)
    s2 = set(x.split("=")[0] for x in f2)
    result = s2-s1

If you need the full line:

with open("path/to/file1") as f1, open("path/to/file2") as f2:
    s1 = set(x.split("=")[0] for x in f1)
    result = [x for x in f2 if not x.split("=")[0] in s1]

edited Jun 15 '21 at 09:46

answered Jun 15 '21 at 09:42

fsimonjetz

5,644
3
5
21

I need the full line. – mutatedzombie Jun 15 '21 at 09:46
@mutatedzombie ok, then it's a tad more complicated. – fsimonjetz Jun 15 '21 at 09:47

Fast way of comparing lines in 2 files?

3 Answers3