3

I have a very large text file with duplicate entries which I want to eliminate. I do not care about the order of the entries because the file will later be sorted.

Here is what I have so far:

unique_lines = set()
outfile = open("UniqueMasterList.txt", "w", encoding = "latin-1")

with open("MasterList.txt", "r", encoding = "latin-1") as infile:
    for line in infile:
        if line not in unique_lines:
            outfile.write(line)
            unique_lines.add(line)

outfile.close()

It has been running for 30 minutes and has not finished. I need it to be faster. What is a faster approach in Python?

  • 5
    I would just use `sort -u -o UniqueMasterList.txt MasterList.txt` rather than write any custom code. – chepner Dec 20 '16 at 20:14
  • Why don't you sort it first? It will be easy to remove the duplicates then. – martianwars Dec 20 '16 at 20:15
  • 2
    Not related to your issue, but you don't need to close `infile` since you opened it using the `with` keyword. – Will Dec 20 '16 at 20:15
  • martianwars, why would it be easier? –  Dec 20 '16 at 20:18
  • @Kos Any duplicate lines will be immediately adjacent to each other, meaning you don't have to keep the entire unique set in memory. – chepner Dec 20 '16 at 20:19
  • 1
    When you say "very large file", how large? Any chance it will fit in memory? I presume so, since the `set` will grow too large if not. If you're running a 64-bit version of Python it may be swapping, which would definitely make it too slow. – Mark Ransom Dec 20 '16 at 20:24
  • It's 5gb, and I have 16gb of RAM. I'm running 64-bit. What do you mean by swapping? –  Dec 20 '16 at 20:25
  • 1
    Possible duplicate of [How do you remove duplicates from a list in whilst preserving order?](http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order) –  Dec 20 '16 at 20:29

4 Answers4

6

Look for the corresponding system command. In Linux/UNIX, you would use

uniq MasterList.txt > UniqueMasterList.txt

The OS generally knows the best way to do these things.


post-comment edit

@Mark Ransom reminded me that uniq depends on matching lines being contiguous in the file. The simplest way to achieve this is to sort the file:

sort MasterList.txt | uniq > UniqueMasterList.txt
Prune
  • 76,765
  • 14
  • 60
  • 81
  • Why is it faster? And also (I edited the title to make it more clear), is there a faster way to do this but in Python? –  Dec 20 '16 at 20:23
  • It's faster because the OS can optimize for its file structure, storage, and access methods. You can do it faster in Python by calling the system command from your Python script. A second choice might be to hash the lines to speed up the **in** operator. For a test on this, insert them into a set instead of a list and see what performance change you get. – Prune Dec 20 '16 at 20:28
  • @Prune the original question already uses a `set`. Using a cryptographic hash in place of the original line string might reduce the data load, but it's an open question whether it would be faster. – Mark Ransom Dec 20 '16 at 20:36
  • 3
    P.S. `uniq` only works if the data is *already* sorted, you must pair it with `sort` otherwise. If you don't, it will be *very* fast but incorrect. – Mark Ransom Dec 20 '16 at 20:36
  • 2
    The `sort` command can also be used to remove duplicates directly using the flag `u`. If you are running the code on a multicore system it can also leverage multiple cores, e.g. `sort --parallel=4 -u file-with-duplicates.csv > file-no-duplicates.tsv`. – Robin Keskisarkka Apr 19 '18 at 07:32
  • On windows 10, if you have wsl, you can use `wsl uniq` – VingtCent Jul 16 '22 at 18:35
0

To use the same technique as uniq, in Python:

import itertools
with open("MasterList.txt", "r", encoding = "latin-1") as infile:
    sorted_file = sorted(infile.readlines())
for line, _ in itertools.groupby(sorted_file):
    outfile.write(line)

This presumes that the entire file will fit into memory, twice. Or that the file is already sorted and you can skip that step.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
0

The simple approach that i would suggest is use of hashing and hash tables.You can hash each line using a efficient hash function and then insert it into a hash table and output contents where count is 1.Similar to solving word /letter count problem using hash tables.For look up it would only cost o(1) and usage of memory can be restricted to a constant amount depending on the size of hash table used.

mohor chatt
  • 356
  • 1
  • 8
0
SPLIT_COUNT = 30


def write_data(t_file, value):
    t_file.write(value)


def calculate_hash(filename, handle_file):

    with open(filename, 'r') as f:

        for line in f:

            write_data(handle_file[hash(line)%SPLIT_COUNT], line)


def generate_file(dir):

    handle_file, files = [], []

    for i in range(SPLIT_COUNT):

        path = dir+"split_"+str(i)

        files.append(path)

        f = open(path, 'w')

        handle_file.append(f)

    return files, handle_file


def close_file(handle_file):

    for i in range(len(handle_file)):

        handle_file[i].close()


def data_uniq(files, new_file):

    dataset = dict()

    n_file = open(new_file, 'w')

    for filename in files:

        f = open(filename, 'r')

        for line in f:

            dataset[line] = 1

        f.close()

        for key in dataset.keys():

            n_file.write(key)

        dataset = {}

    n_file.close()


if __name__ == "__main__":
    filename = './clean.txt'
    generate_dir = './tmp/'
    new_file = './out.txt'
    files, handle_file = generate_file(generate_dir)
    calculate_hash(filename, handle_file)
    close_file(handle_file)
    data_uniq(files, new_file)
TanLingxiao
  • 402
  • 5
  • 7