Python: Removing duplicate tuples in 2 csv files

Question

I have 2 csv files containing sorted tuples of integers

old file.csv

"(1, 2, 3)","(1, 2, 4)","(1, 3, 5)"

new file.csv

"(1, 2, 3)","(1, 2, 4)"

I want to remove common tuples between these two csv files and print the output as final.csv

Expected Output

"(1,3,5)"

Code Attempt A

import csv

with open('old file.csv', newline ='') as myFile_1:  
    reader = csv.reader(myFile_1)
    list_a = list(reader)
    older = [tuple(map(int, i)) for i in list_a]

with open('new file.csv', newline ='') as myFile_2:  
    reader = csv.reader(myFile_2)
    list_b = list(reader)
    newer = [tuple(map(int, i)) for i in list_b]

final_output = older.difference(newer)

csvData = [final_output]

with open('final.csv', 'w') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerows(csvData)

csvFile.close()

Error Type

Exception has occurred: ValueError
invalid literal for int() with base 10: '(1, 2, 3)'

Code Attempt B

import csv

with open('old file.csv', newline ='') as myFile_1:  
    reader = csv.reader(myFile_1)
    list_a = list(reader)
    older = [tuple(map(str, i)) for i in list_a]

with open('new file.csv', newline ='') as myFile_2:  
    reader = csv.reader(myFile_2)
    list_b = list(reader)
    newer = [tuple(map(str, i)) for i in list_b]

final_output = older.difference(newer)

csvData = [final_output]

with open('final.csv', 'w') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerows(csvData)

csvFile.close()

Error Type

Exception has occurred: AttributeError
'list' object has no attribute 'difference'

This issue arose when I wanted to manipulate csv files and worked pretty well when the data contained in old.csv and new.csv were generated while running the program and were stored as a variable. This works fine when generating smaller data sets but is extremely problematic when generating large data sets.

Does order matter? This is two-fold. Is (1, 2, 4) different from (2, 1, 4)? Does it matter what order (1, 2, 4) and (3, 5, 7) appear in the final list? — fendall, Sep 05 '19 at 19:52
Yes order matters. (1,2,4) is a separate tuple and (2,1,4) is a separate one. The original tuple file is an ordered tuple in usually ascending (1,2,4) or descending (4,2,1) order and I would like to maintain the same. In no scenario would I be making a comparison between two files having values in opposite orders — user7970547, Sep 05 '19 at 19:54
It does not matter what order (1,2,4) and (3,5,7) appear in the final list. However, I can sort the appearance of tuples based on their indexes by writing a function. — user7970547, Sep 05 '19 at 20:00
Perhaps this helps you: https://stackoverflow.com/questions/3462143/get-difference-between-two-lists. Use sets instead of lists — b3rt0, Sep 05 '19 at 20:11

score 3 · Accepted Answer · answered Sep 05 '19 at 20:10

I would actually recommend to change data storing strategy and not save raw data structures representation into a csv files.
But if you're not allowed to effect those things - use the following short approach:

import csv
from ast import literal_eval

with open('old_file.csv', newline ='') as f1, open('new_file.csv', newline ='') as f2:
    t1 = literal_eval('{{{}}}'.format(f1.read().replace('"', '')))
    t2 = literal_eval('{{{}}}'.format(f2.read().replace('"', '')))

    final_output = t1 - t2

with open('final.csv', 'w') as csv_result:
    writer = csv.writer(csv_result, delimiter=',', quotechar='"')
    writer.writerow(final_output)

literal_eval('{...}' - allows to get a set of tuples at once wrapping the passed argument with set object literal {}

The final final.csv file contents:

"(1, 3, 5)"

How would you reckon with the OP's concern of large data sets? Are Python sets really more memory efficient than whatever approach was used before the data was stored in csv files? — fendall, Sep 05 '19 at 20:20

Pynchia · Answer 2 · 2019-09-06T08:24:50.540

Assuming you can alter the format of the input files, as suggested above, for example with the shell command

cat file.csv | sed 's/",/"\n/g' > file.txt

and go for one tuple per line, a solution that allows you to process larger files would be:

import itertools as it

diff = set()
with open('old_file.txt') as f1, open('new_file.txt') as f2:
    ts = it.chain(f1, f2)
    for t in ts:
        t2 = t.rstrip()  # remove the newline
        try:
            diff.remove(t2)
        except KeyError:
            diff.add(t2)

print(diff)

produces

{'"(1, 3, 5)"'}

Note: diff is a set that contains the tuples as strings, i.e. in their original format, as that is the output you have indicated. A set is used because it is fast to check if an element is present.

Basically go through each value one by one, one file after the other and remove each element. If the value isn't there yet then add it to the difference.

Only the wanted difference grows in memory, each file is simply iterated upon without creating a list.

However, if the tuples can appear more than once in each file, we must keep track of where they come from. After scanning the values we filter the results and keep the ones that appear in one file only.

For example with inputs like

(m37) ➜  SO cat old_file.txt      
"(1, 2, 3)"
"(1, 2, 4)"
"(1, 2, 4)"
"(1, 3, 5)"
(m37) ➜  SO cat new_file.txt      
"(1, 2, 3)"
"(3, 7, 9)"
"(1, 2, 4)"
"(3, 7, 9)"
"(1, 2, 3)"

this code

from collections import defaultdict
import itertools as it

diff = defaultdict(set)
with open('old_file.txt') as f1, open('new_file.txt') as f2:
    ts = it.chain(zip(f1, it.repeat(1)), zip(f2, it.repeat(2)))
    for t, fi in ts:
        t2 = t.rstrip()
        diff[t2].add(fi)

final_diff = (t for t, fs in diff.items() if len(fs) == 1)
print(list(final_diff))

produces

['"(1, 3, 5)"', '"(3, 7, 9)"']

Note: the code can be easily altered to cater for any number of files, not just two.

@user7970547 This solution most closely addresses the original post, including handling for large data sets — fendall, Sep 05 '19 at 20:51
@Pynchia Changing file format would be difficult, however I can generate the files as .csv or .xlsx or .txt. Will I have to generate each tuple as a newline (/n) for the code to function properly? — user7970547, Sep 06 '19 at 17:18
If you cannot separate the tuples with a newline, keep the files as you have described and transform them with the sed command I have shown. Alternatively, consider [this QA](https://stackoverflow.com/questions/16260061/reading-a-file-with-a-specified-delimiter-for-newline) — Pynchia, Sep 06 '19 at 18:47
@Pynchia Is there a way to convert this using command prompt in windows? Conversion is turning out to be pretty tricky moving from one system to another. — user7970547, Sep 14 '19 at 19:40
The solution is merging the files while I am looking for 'set difference' . Remove all elements of new file contained in the old file. — user7970547, Feb 18 '20 at 19:03

score 0 · Answer 3 · answered Sep 05 '19 at 20:33

Here's what I worked out:

tuple_1_list = open('test_tuple_1.csv', 'r').read().splitlines()
tuple_2_list = open('test_tuple_2.csv', 'r').read().splitlines()

tuple_1_list = list(map(lambda x: tuple(x.replace('\'', '').replace('\"', '')[1:-1].split(', ')), tuple_1_list))
tuple_2_list = list(map(lambda x: tuple(x.replace('\'', '').replace('\"', '')[1:-1].split(', ')), tuple_2_list))

output_set = set()

for i in tuple_1_list:
    output_set.add(i)

for i in tuple_2_list:
    output_set.add(i)

for i in tuple_1_list:
    if i in tuple_2_list and i in output_set:
        output_set.remove(i)

for i in tuple_2_list:
    if i in tuple_1_list and i in output_set:
        output_set.remove(i)

with open('test_tuple_out.csv', 'w+') as out_file:
    for output in output_set:
        out_file.write('\"' + str(output).replace('\'', '').replace(', ', ',') + '\"')

I was able to get your expected output with your input CSVs. Hope this works for you.

Python: Removing duplicate tuples in 2 csv files

3 Answers3