0

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

For example, if this is file1:

A=1
B=2
C=3

And this is file2:

A=10
B=20
C=30
D=5

Then my result/output should be:

D=5

Since there is no D=Something in File1.

  • The first step is certainly to sort both of them. – Hack5 Jun 15 '21 at 09:18
  • You should decide whether this is a python or bash question. It can't be both. – oguz ismail Jun 15 '21 at 09:51
  • 1
    Does this answer your question? [Fast way of finding lines in one file that are not in another?](https://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another) – mihi Jun 15 '21 at 10:35
  • 2
    This has been marked as a duplicate, but it isn't a duplicate. The linked question is about finding matching whole lines. This question is about matching partial lines. Similar, but not a duplicate, and the answers to the other question need adapting to work here. – Chris Lear Jun 15 '21 at 11:24
  • 1
    `awk -F= 'NR==FNR{a[$1]=1; next} !a[$1]' file1 file2 file3...` This will print the lines which are not in file1. You can supply any no of files after file1. – BarathVutukuri Jun 15 '21 at 11:24
  • @Chris Lear Ah sorry :/. Based my comment on the question text. Don't have the privileges to cast a reopen vote though. – mihi Jun 15 '21 at 11:40

3 Answers3

2

You could read file 1 to a list, then read file 2 and for each entry check whether it is in that list.

filelist1 = []
with open(file1, 'r') as f:
    for line in f:
        filelist1.append(line.split('=')[0])

with open(file2, 'r') as f:
    for line in f:
        if line.split('=')[0] not in filelist1:
            print(line)

Should do the trick.

jacobscgc
  • 31
  • 2
1

This works for the example given, and is quite simple:

grep -v -F "`grep -o .*= file1`" file2

I tried it on an artificially-created 30000 line file, and it was fast.

It just uses grep -o to create a list of matches which is then fed into grep -F as fixed strings. Then -v is used to say 'show lines that don't match'

Some caveats:

  • this is case sensitive, so A=10 is different from a=10.
  • there's an assumption that there is exactly one '=' sign on any line that's significant in file1, and that everything to the left of it (including spaces) is part of the check.
  • Probably a bug if file1 contains A=10 and file2 contains AAA=10, A=10 will be found in file2, and therefore that line won't be reported. I'll try to rewrite the one-liner to fix this bug

Another option, which is simpler and actually nicer

join -t= -v 2 <(sort file1) <(sort file2)

This one requires file1 and file2 to be sorted first, but doesn't show the bug that the grep version above shows. It's also probably faster (I haven't really checked). The other caveats above still apply.

Chris Lear
  • 6,592
  • 1
  • 18
  • 26
0

Something like this would work but it would only give you "D", not the complete line - depends on what you need downstream and what exactly your files look like. Also, as mentioned by Aaron, it should be quite fast.

with open("path/to/file1") as f1, open("path/to/file2") as f2:
    s1 = set(x.split("=")[0] for x in f1)
    s2 = set(x.split("=")[0] for x in f2)
    result = s2-s1

If you need the full line:

with open("path/to/file1") as f1, open("path/to/file2") as f2:
    s1 = set(x.split("=")[0] for x in f1)
    result = [x for x in f2 if not x.split("=")[0] in s1]

fsimonjetz
  • 5,644
  • 3
  • 5
  • 21