Deleting duplicate lines in python

Question

I have data of the following form:

#@ <abc>
 <http://stackoverflow.com/questions/ask> <question> _:question1 .
#@ <def>
<The> <second> <http://line> .
#@ <ghi>
 _:question1 <http#responseCode> "200"^^<http://integer> .
#@ <klm>
<The> <second> <http://line1.xml> .
#@ <nop>
 _:question1 <date> "Mon, 23 Apr 2012 13:49:27 GMT" .
#@ <jkn>
<G> <http#fifth> "200"^^<http://integer> .
#@ <k93>
 _:question1 <http#responseCode> "200"^^<http://integer> .
#@ <k22>
<This> <third> <http://line2.xml> .
#@ <k73>
 <http://site1> <hasAddress> <http://addr1> .
#@ <i27>
<kd8> <fourth> <http://addr2.xml> .

Now whenever two lines are equal, like: _:question1 <http#responseCode> "200"^^<http://integer> ., then I want to delete the equal lines (lines which match with each other character by character are equal lines) along with (i). the subsequent line (which ends with a fullstop) (ii). line previous to the equal line (which begins with #@).

#@ <abc>
 <http://stackoverflow.com/questions/ask> <question> _:question1 .
#@ <def>
<The> <second> <http://line> .
#@ <nop>
 _:question1 <date> "Mon, 23 Apr 2012 13:49:27 GMT" .
#@ <jkn>
<G> <http#fifth> "200"^^<http://integer> .
#@ <k73>
 <http://site1> <hasAddress> <http://addr1> .
#@ <i27>
<kd8> <fourth> <http://addr2.xml> .

Now one way to do this is to store all these lines in a set in python and whenever two lines are equal (i.e. they match character by character) the previous and subsequent two lines are deleted. However, the size of my dataset is 100GB (and I have RAM of size 64GB), therefore I can not keep this information in set form in main-memory. Is there some way by which I can delete the duplicate lines along with their previous and subsequent two lines in python with limited main-memory space (RAM size 64 GB)

You don't need to store all the lines, you can remove dupes in sections, it will mean reopening files but if you cannot store all the lines at once you don't have many options — Padraic Cunningham, Jul 06 '15 at 22:09

fferri · Answer 1 · 2015-07-06T22:41:31.873

Keep a boolean hashtable of hash codes of lines already seen.

For each line:

if line hash()es to something you have already seen, you have a potential match: scan the file to check if it really is a duplicate.
if line hash()es to a new hash, just mark the hash for the first time.

Dedicate as much memory you can to this hashtable, and the false positive rate will be low (i.e. less times you will have to scan for duplicates and found none).

Example:

table_size = 2**16
seen = [False]*table_size
for line in file:
    h = hash(line) % table_size
    if seen[h]:
        dup = False
        with open('yourfile','r') as f:
            for line1 in f:
                if line == line1:
                    dup = True
                    break
            if not dup:
                print(line)
    else:
        seen[h] = True
        print(line)

As it has been pointed out, since you cannot store all the lines in memory you don't have many options, but at least this option doesn't require to scan the file for every single line, because most of the entries in the table will be False, i.e. the algorithm is sub-quadratic if the tabe is not full; it will degenerate to O(n²) once the table is full.

You can make a very memory-efficient implementation of the hash table, that requires only 1 bit for each hash code (e.g. make it an array of bytes, where each byte can store 8 boolean values)

See also Bloom Filters for more advanced techniques.

Alternatively, one might represent each line as a set of k-mers and use rolling hash to quickly hash the entire line in O(n). Then you can use MinHash to approximate set comparison. To do this you need to come up with a set of N prime numbers and `xor` the initial hash with each of them. Then you are to pick the smallest value. This might reduce the rate of false-positive matches, hence reduce the need in character-wise line comparison. — Eli Korvigo, Jul 06 '15 at 22:44
And using `array.array` instead of `list` might help reduce the memory impact. — Eli Korvigo, Jul 06 '15 at 23:06

pvg · Answer 2 · 2015-07-06T22:56:39.873

One fairly straightforward way - make a version of your data such that each line includes a field with its line number. Use unix 'sort' to sort that new file, excluding the line number field. The sort utility will merge sort the file even if it exceeds the size of available memory. Now you have a new file in which the duplicates are ordered, along with their original line numbers. Extract the line numbers of the duplicates and then use that as input for linearly processing your original data.

In more detailed steps.

Make a new version of your file such that each line is prepended by its line number. So, "someline" becomes "1, someline"
sort this file using the unix sort utility - sort -t"," -k2,2 file
Scan the new file for consecutive duplicate entries in the second field
the line numbers (first field) of such entries are the line numbers of duplicate lines in your original file - extract these and use them as input to remove duplicates in the original data. Since you know exactly where they are, you need not read in the entire file or create a giant in-memory structure for duplicates

The advantage of this method compared to some of the others suggested - it always works, regardless of the size of the input and the size of your available memory and it does not fail due to hash collisions or other probabilistic artifacts. You are leveraging the merge sort in unix sort where the hard stuff - dealing with larger-than-memory input - has been done for you.

This could be tricky because 1) each data item is two lines long and 2) the order of the lines might be important. However if you can use UNIX `sort` then you should probably also use UNIX `uniq` to get the unique lines. — dshepherd, Jul 06 '15 at 22:45
I'm assuming the order matters, also uniq can't really deal with the questions data processing requirements which are beyond merely finding the line but also munging quite a bit of the surrounding output. You can't really do that with sort/uniq alone — pvg, Jul 06 '15 at 22:58

score 1 · Answer 3 · edited May 23 '17 at 11:51

Here's an outline of how I'd do it using UNIX sort/uniq:

Modify the data format so that each record is a single line. You could do this using the methods here.
Sort the data with the sort command. Note the you can specify which fields are important with the --key option, you might need to exclude the #@ <abc> part by selecting all the other fields as keys (I wasn't entirely sure from your description).
Apply the uniq command to the sorted output to get only the unique lines.

This should all work fine on out-of-core data as far as I know.

Deleting duplicate lines in python

3 Answers3