0

I am trying to find an efficient way to compare the content in several text files and find the duplicate lines in them.

I started with nested loop first and it worked.

def process_files(self,directory):
    files=os.listdir(directory)
    files=[os.path.join(directory, file) for file in files]
    for i in range(len(files)):
        file1=files[i]
        fh1=open(file1, 'r')
        file1_raw = fh1.read()
        if i+1 <len(files):
            for i in range(len(files[1:])):
                file2=files[i+1]

                fh2=open(file2, 'r')
                file2_raw = fh2.read()

                file1_words = file1_raw.split()
                file2_words = file2_raw.split()
                for w in file2_words:
                    if w in file1_words:
                        print (w)

Then, I found it very slow as the files are large. So, I tried to use pool workers and finds a way around that. I tried to implement the idea mentioned in here. However, I can't get it to work properly.

I have one requirement: I don't want to compare a file against itself. Which should be considered in zip.

If someone can give some idea in this matter, will be much appreciated. Thanks.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
Likak
  • 373
  • 1
  • 5
  • 19
  • So you want to print all the lines that are at least in two files? – Willem Van Onsem Jan 05 '18 at 23:07
  • If your files are so large, why not use bash `sort` or `grep` commands – taoufik A Jan 05 '18 at 23:12
  • @WillemVanOnsem, At first yes, and eventually I want to remove them from the secondary file. – Likak Jan 05 '18 at 23:13
  • @taoufikA Could you elaborate more on that. – Likak Jan 05 '18 at 23:15
  • 1
    @Likak: what is "*large*"? Are we talking about ~10kiB? ~1MiB? ~1GiB? – Willem Van Onsem Jan 05 '18 at 23:15
  • @WillemVanOnsem, between 1GiB and 3GiB – Likak Jan 05 '18 at 23:16
  • 4
    Even if Python's multiprocessing or multithreading were efficient, they wouldn't help here. Your algorithm is not trivial to parallelise and it is inefficient (it scales bad), making a constant performance boost from parallel execution irrelevant. Can you afford storing these files in RAM? – Eli Korvigo Jan 05 '18 at 23:16
  • I'm already surprised that this does not run out of memory, since you store a list of words. – Willem Van Onsem Jan 05 '18 at 23:20
  • 3
    Your question says "lines" but your code says "words". Which is it? – Mark Ransom Jan 05 '18 at 23:21
  • @WillemVanOnsem perhaps the OP has got tons of RAM or his/her memory goes full thrashing, which would explain, why he/she never gets to the exception. – Eli Korvigo Jan 05 '18 at 23:23
  • 1
    If you make `file1_words` a `set` instead of a `list` it will speed up greatly, at the cost of more memory used. You should also move the creation outside of the inner loop, since it doesn't change for each new `file2`. – Mark Ransom Jan 05 '18 at 23:24
  • @MarkRansom I would not recommend using built-in hash-tables for that, because they are notoriously memory inefficient. – Eli Korvigo Jan 05 '18 at 23:25
  • @EliKorvigo reading an entire file into memory is already memory inefficient, but that doesn't appear to bother the O.P. The strings themselves will be equal in size, it's only the difference between `list` and `set` overhead that will matter; do you know what a typical ratio would be? – Mark Ransom Jan 05 '18 at 23:27
  • @MarkRansom https://stackoverflow.com/a/31153174/3846213 – Eli Korvigo Jan 05 '18 at 23:29
  • @EliKorvigo that was for a `dict` not a `set`, and the elements were small (integer). I'm not sure if that generalizes to this case. – Mark Ransom Jan 05 '18 at 23:33
  • @MarkRansom, about your question concerning word or line. You are right. My bad. The lines have two words separated by space; first: MD5 of the filename, second: filename. – Likak Jan 05 '18 at 23:41
  • @EliKorvigo, I need to find out if I can store them on RAM. Thanks for your idea. – Likak Jan 05 '18 at 23:42
  • @Likak you would still need to modify your algorithm, because you already store the data in RAM, that is RAM doesn't solve the issue, it just lets you consider more efficient ways to deal with the issue. – Eli Korvigo Jan 05 '18 at 23:44
  • @MarkRansom since OP actually stores words (short strings, which are comparable to integers in terms of memory requirements), the overhead would kill his/her RAM. `from sys import getsizeof;from random import choice;from string import ascii_letters;nwords = 10000;world_length = 6;words = [''.join(choice(ascii_letters) for _ in range(world_length)) for _ in range(nwords)];getsizeof(words);getsizeof(set(words))` – I get 87624 bytes for the list and 524512 bytes for the set (almost 10x as many). – Eli Korvigo Jan 05 '18 at 23:46
  • By duplicate lines do you mean exact line to line match? So you have files which contain lets say strings per line and you want to find duplicate strings? – Chenna V Jan 05 '18 at 23:58
  • @EliKorvigo: The overhead is higher because the goal of `set` is fast membership testing, and they're willing to pay memory for it. The actual multiplier vs. `list` is between 3x and 12x in testing, because `set`s only increase in size when the number of entries quadruples (`list`s resize more granularly). You could get lower memory usage (much lower on 3.6) in exchange for slightly slower lookups by using `dict.fromkeys` to make a `dict` where the values are ignored. `getsizeof(dict.fromkeys(words))` is 295008 on 3.6, and 393312 on 3.5. If you have the memory, `O(1)` lookups are worth it. – ShadowRanger Jan 06 '18 at 00:19
  • @ShadowRanger I know what a hash-table is and when/why one would want to use it :) I'm only saying, that Python's hash-tables are not exactly memory-friendly enough to store all words in two large files, though there are highly efficient implementations in other languages (e.g. sparse hash map in C++). – Eli Korvigo Jan 06 '18 at 01:52

2 Answers2

0

Here are some solutions using bash comm, sort and awk that will redirect the unique values of file2 into out

comm <(sort f1) <(sort f2) -13 > out

If you wanna more speed use the sort --parallel option.

Using awk

awk 'NR==FNR{lines[$0];next} !($0 in lines)' f1  f2 > out

file 1 :

I'm unique and I leave if file 1
1
2
3
4
5
I'm unique and I leave if file 1
6
I'm unique and I leave if file 1
7
I'm unique and I leave if file 1

File 2 :

1
2
I'm unique and I leave if file 2
3
4
5
I'm unique and I leave if file 2
6
7

Out

I'm unique and I leave if file 2
I'm unique and I leave if file 2
taoufik A
  • 1,439
  • 11
  • 20
0

By duplicate lines if you mean an exact line to line match then consider a database to which you insert an entire line as a string and check if the line exists in the database. You can try databases like MongoDB or redis

Chenna V
  • 10,185
  • 11
  • 77
  • 104