Find and remove duplicates in a CSV file

Question

I have a large CSV file (1.8 GB) with three columns. Each row contains two strings and a numerical value. The problem is that they are duplicate but swapped. Example:

Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123

The desired output would look like this:

Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

because the third row contains the same information like the first row.

EDIT

The data basically looks like this (Strings in the first two columns and a numerical value in the third, 40 Million lines):

Blockquote

Yes, looking for unique rows and filtering them out in the unix command line but that does not work. — BeBo, Apr 06 '19 at 17:50
Do you mind if the output gets accidentally swapped like `454, EFG, ABC`? — Ender Look, Apr 06 '19 at 18:12
The next time you post a SO question, try putting your attempt , in the form of code, too.. — sjsam, Apr 07 '19 at 08:02

James Brown · Answer 1 · 2019-04-06T19:48:29.113

7

Can you handle awk:

$ awk -F, '++seen[$3]==1' file

Output:

COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

Explaied:

$ awk -F, '      # set comma as field delimiter
++seen[$3]==1    # count instances of the third field to hash, printing only first
' file

Update:

$ awk -F, '++seen[($1<$2?$1 FS $2:$2 FS $1)]==1' file

Output:

COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

It hashes every met combination of first and second field so that "ABC,DEF"=="DEF,ABC" and counts them printing only the first. ($1<$2?$1 FS $2:$2 FS $1): if first field is less than second, hash 1st,2nd else hash 2nd,1st.

edited Apr 06 '19 at 19:48

answered Apr 06 '19 at 17:52

James Brown

36,089
7
43
59

For some reason this only filters the first 6000 entries. and deletes the rest or discards it. – BeBo Apr 06 '19 at 18:05
Interesting. Which awk do you have? – James Brown Apr 06 '19 at 18:07
If I'm not entirely mistaken it's mawk 1.3.3 Nov 1996. – BeBo Apr 06 '19 at 18:09
Another point is: This only filters out entries with a specific value. Means: It filter out entries that start with ABC but everything else gets deleted even though they are unique. – BeBo Apr 06 '19 at 18:18
No, it prints only those records which have a unique value in the third field, ie. values which it sees for the first time. For example, it won't print `DEF,ABC,123` as it sees the `123` in the third field for the second time. – James Brown Apr 06 '19 at 18:25
Ahh, that makes sense. How do I get a unique combination of Field 1 and Field 2? – BeBo Apr 06 '19 at 18:27
1

Okay, this seems to work fine on my example. It's running with the actual data. Thanks a lot! Upvote does not work yet but this answer worked for me!. – BeBo Apr 06 '19 at 18:48
2

`++a[key]==1` is more commonly/idiomatically implemented as `!seen[key]++`. – Ed Morton Apr 06 '19 at 19:43
Will this work if there is a column say 'DEF,GHI,123` as the forth column? It won't print it when it should – sjsam Apr 07 '19 at 07:26
1

@JamesBrown Your updated version does handle it. :) – sjsam Apr 07 '19 at 07:51
@JamesBrown : But I do have regret that your `awk` solution while being idiomatic does look a bit unreadable.. But I still consider it pretty smart :) – sjsam Apr 07 '19 at 07:56
WAAT? Even after I lost my integrity and replaced `a[]` with `seen[]` and everything. – James Brown Apr 07 '19 at 08:32

Ender Look · Accepted Answer · 2019-04-13T19:47:44.470

Note: This question was done before the OP changed the python tag for awk tag.

If you don't mind the order of the elements you might do:

with open("in.csv", "r") as file:
    lines = set()
    for line in file:
        lines.add(frozenset(line.strip("\n").split(",")))

with open("out.csv", "w") as file:
    for line in lines:
        file.write(",".join(line)+"\n")

Output:

Col2,COL1,Col3
EFG,454,ABC
DEF,123,ABC

Note that you might want to treat the first line (the titles) in an special way to not loose their order.

But if the order matter you could use the code from Maintaining the order of the elements in a frozen set:

from itertools import filterfalse

def unique_everseen(iterable, key=None):
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element        

with open("in.csv", "r") as file:
    lines = []
    for line in file:
        lines.append(line.strip("\n").split(","))

with open("out.csv", "w") as file:
    for line in unique_everseen(lines, key=frozenset):
        file.write(",".join(line)+"\n")

Output:

COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

The OP said that both codes seem to not work on large files (1.8 Gb). I think it may be due to the fact that both codes store the file in a list using the RAM, and a file of 1.8 GB might take all the available space on memory.

In order to solve that I made a few more attempts. Sadly, I must say that all of them are extremely slow compared to the first attempt. The firsts codes sacrifice RAM consumption for speed, but the following codes sacrifice speed, CPU and hard drive for less RAM consumption (instead of consuming the whole file size in RAM they take less than 50 Mb).

Since all of this examples needs a higher hard drive usage, it's advisable to has the "input" and "output" file on different hard drives.

My first attempt using less RAM is with the shelve module:

import shelve, os
with shelve.open("tmp") as db:
    with open("in.csv", "r") as file:
        for line in file:
            l = line.strip("\n").split(",")
            l.sort()
            db[",".join(l)] = l

    with open("out.csv", "w") as file:
        for v in db.values():
            file.write(",".join(v)+"\n")

os.remove("temp.bak")
os.remove("temp.dat")
os.remove("temp.dir")

Sadly, this code takes hundred of times more than the first two codes which uses the RAM.

Another attempt is:

with open("in.csv", "r") as fileRead:
    # total = sum(1 for _ in fileRead)
    # fileRead.seek(0)
    # i = 0
    with open("out.csv", "w") as _:
        pass
    with open("out.csv", "r+") as fileWrite:
        for lineRead in fileRead:
            # i += 1
            line = lineRead.strip("\n").split(",")
            lineSet = set(line)
            write = True
            fileWrite.seek(0)
            for lineWrite in fileWrite:
                if lineSet == set(lineWrite.strip("\n").split(",")):
                    write = False
            if write:
                pass
                fileWrite.write(",".join(line)+"\n")
            # if i / total * 100 % 1 == 0: print(f"{i / total * 100}% ({i} / {total})")

This is slightly faster but not much.

If your computer has several cores, you could try to use multiprocessing:

from multiprocessing import Process, Queue, cpu_count
from os import remove

def slave(number, qIn, qOut):
    name = f"slave-{number}.csv"
    with open(name, "w") as file:
        pass
    with open(name, "r+") as file:
        while True:
            if not qIn.empty():
                get = qIn.get()
                if get == False:
                    qOut.put(name)
                    break
                else:
                    write = True
                    file.seek(0)                    
                    for line in file:
                        if set(line.strip("\n").split(",")) == get[1]:
                            write = False
                            break
                    if write:
                        file.write(get[0])

def master():
    qIn = Queue(1)
    qOut = Queue()
    slaves = cpu_count()
    slavesList = []

    for n in range(slaves):
        slavesList.append(Process(target=slave, daemon=True, args=(n, qIn, qOut)))
    for s in slavesList:
        s.start()

    with open("in.csv", "r") as file:
        for line in file:
            lineSet = set(line.strip("\n").split(","))
            qIn.put((line, lineSet))
        for _ in range(slaves):
            qIn.put(False)

    for s in slavesList:
        s.join()

    slavesList = []

    with open(qOut.get(), "r+") as fileMaster:
        for x in range(slaves-1):
            file = qOut.get()
            with open(file, "r") as fileSlave:
                for lineSlave in fileSlave:
                    lineSet = set(lineSlave.strip("\n").split(","))
                    write = True
                    fileMaster.seek(0)
                    for lineMaster in fileMaster:
                        if set(lineMaster.strip("\n").split(",")) == lineSet:
                            write = False
                            break
                    if write:
                        fileMaster.write(lineSlave)

            slavesList.append(Process(target=remove, daemon=True, args=(file,)))
            slavesList[-1].start()

    for s in slavesList:
        s.join()

As you can see, I have the disappointing task to tell you that my both attempts work really slow. I hope you find a better approach, otherwise, it will take hours if not days to execute on 1,8 GB of data (the real time will primarily depend on the number of repeated values, which reduces time).

A new attempt: instead of storing every in file, this attempt stores the active portion on memory, and then write down on a file in order to process chunks faster. Then, the chunks must be read again by using one of the above methods:

lines = set()
maxLines = 1000 # This is the amount of lines that will be stored at the same time on RAM. Higher numbers are faster but requeires more RAM on the computer
perfect = True
with open("in.csv", "r") as fileRead:
    total = sum(1 for _ in fileRead)
    fileRead.seek(0)
    i = 0
    with open("tmp.csv", "w") as fileWrite:            
        for line in fileRead:
            if (len(lines) < maxLines):                    
                lines.add(frozenset(line.strip("\n").split(",")))
                i += 1
                if i / total * 100 % 1 == 0: print(f"Reading {i / total * 100}% ({i} / {total})")
            else:
                perfect = False
                j = 0
                for line in lines:
                    j += 1
                    fileWrite.write(",".join(line) + "\n")
                    if i / total * 100 % 1 == 0: print(f"Storing {i / total * 100}% ({i} / {total})")
                lines = set()

if (not perfect):
   use_one_of_the_above_methods() # Remember to read the tmp.csv and not the in.csv

This might boost the speed. You can change maxLines by any number you like, remember that higher the number, greater speed (not sure if really big numbers do the opposite) but higher RAM consumption.

Hi Ender Look, thanks, this works too. So in case anyone is a bigger python fan --> This answer works as well. I prefered the answer by James Brown only for the reason that it runs faster on my machine and the file is massive so it takes time to run. But thanks nonethelles. :) — BeBo, Apr 06 '19 at 19:35
Don't worry @BeBo, performance is something very important! Especially for 1.8 GB worths of data! By the way, I think you shouldn't remove the `python` tag for your question since it invalidates all the previous answers. — Ender Look, Apr 06 '19 at 20:13
So, after trying a lot of different things, I tried this with my actual data (after it worked with my small test data set). With my actual big data file it did not produce an output and I can't tell why because there is no error message at all. :-( — BeBo, Apr 07 '19 at 13:33
Are you sure @BeBo? I haven't tested with 1.8 GB of data but 100 MB and it works fine. Are you saying that no file was written as output or the output was wrong? — Ender Look, Apr 07 '19 at 13:40
No the output file that I got for a small dataset was fine. Only, when I tried it with a 1.8 GB data set it did not produce an output file. — BeBo, Apr 07 '19 at 13:43
Do you use Python 32 or 64 bits? How much RAM do you have? The script store all the data in the RAM to work, so if you don't have enough RAM maybe the script stop working. — Ender Look, Apr 07 '19 at 13:48
I will try to design a better version later, in order to not use so much RAM. — Ender Look, Apr 07 '19 at 13:48
@BeBo sorry for the bad approach. The worst part is that the code can't be executed faster using multithreading nor multiprocessing due to the need of share the same file/variable. Maybe the strategy *divide and conquer*? — Ender Look, Apr 08 '19 at 14:55
@BeBo, by the way, do you mind if it's in another language than python? — Ender Look, Apr 08 '19 at 23:58
@BeBo, I've added a new attempt using multiprocessing, it will help if you have several cores. — Ender Look, Apr 09 '19 at 01:36
Yeah I tried it but I am not quite sure how it is supposed to work. It does not give me any output. Am I doing something wrong? — BeBo, Apr 13 '19 at 17:16
@BeBo, you know that you must change the name of the file in `open()`, right? By the way, are you constrained to only use python or are you free to use other languages? — Ender Look, Apr 13 '19 at 18:42
It depends on the language, tbh. I know R and python. I can handle unix commands okay-ish. What did you have in mind? — BeBo, Apr 14 '19 at 08:12

score 2 · Answer 3 · answered Apr 07 '19 at 07:49

2

From the problem description, the mandate for a line to be NOT omitted is when the first and second fields in either order when concatenated should be unique. If so, below awk would help

awk -F, '{seen[$1,$2]++;seen[$2,$1]++}seen[$1,$2]==1 && seen[$2,$1]==1' filename

Sample Input

Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
GHI,ABC,123
DEF,ABC,123
ABC,GHI,123
DEF,GHI,123

Sample Output

Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
GHI,ABC,123
DEF,GHI,123

answered Apr 07 '19 at 07:49

sjsam

21,411
5
55
102

Thanks. This seems to work too with my test data set. For some weird reason, it produces the same output file with my actual data set: Same size, same entries. Why does it not work with my big data file? Am I missing something? :-( – BeBo Apr 07 '19 at 09:58
@BeBo Not sure what the real cause is. Also you may accept the more idomatic [answer](https://stackoverflow.com/a/55551872/1620779) by \@james-brown which first addressed your issue. This is just another version of that answer essentially doing the same thing. – sjsam Apr 07 '19 at 10:28
@BeBo For large files the kernel may kill the `awk` process due to resource exhaustion. See [this](https://unix.stackexchange.com/a/466066/25919). For large files `perl` or `python` may be better options. – sjsam Apr 07 '19 at 10:31
Okay, understood. The thing is: neither the python Script provided by @Ender Look nor the awk function actually _killed_ the process. The pythonic way did not produce a file at all and awk only reproduced the original input file with the duplicates. – BeBo Apr 07 '19 at 13:39
@BeBo Can you share the data or create similar? Let's get to the bottom of this. – James Brown Apr 07 '19 at 17:51
Okay, I will share something similar once I found out how it works. I can't share the data itself unfortunately. :-( – BeBo Apr 07 '19 at 19:36

Tech at The Sparks Foundation · Answer 4 · 2019-05-12T06:21:34.407

If you want to use csv library itself:-

you can use a DictReader and DictWriter.

Import csv
 def main():
 """Read csv file, delete duplicates and write it.""" 
     with open('test.csv', 'r',newline='') as inputfile: 
           with open('testout.csv', 'w', newline='') as outputfile: 
               duplicatereader = csv.DictReader(inputfile, delimiter=',') 
               uniquewrite = csv.DictWriter(outputfile, fieldnames=['address', 'floor', 'date', 'price'], delimiter=',') 
                uniquewrite.writeheader()
                keysread = []
               for row in duplicatereader:
                     key = (row['date'], row['price'])
                     if key not in keysread:
                              print(row) 
                              keysread.append(key)
                              uniquewrite.writerow(row)
 if __name__ == '__main__': 
     main()

Find and remove duplicates in a CSV file

4 Answers4