0

As a beginner in using Python I'm stuck yet again! I've got two CSV files as follows:

CSV1 (Master List)

ID    Name    Code.    Prop
12    SAB     1234     ZXC
12    SAB     1236     ZXC
12    SAB     1233     ZXC
12    SAB     1234     ZXC
11    ASN     1234     ABV
16    HGF     1233     AAA
19    MAB     8765     BCT
19    MAB     8754     BCT

CSV2 (Subset)

ID    Name    Code.    Prop
12    SAB     1234     ZXC
12    SAB     1236     ZXC
12    SAB     1233     ZXC
12    SAB     1234     ZXC
19    MAB     8765     BCT
19    MAB     8754     BCT

My goal is to try and use the values in the first column of the CSVs to compare and identify those that do not occur in CSV2.

EDIT So in the above example rows with ID 11 and 16 from CSV1 (Master List) should get exported.

Something to consider. The ID although unique has multiple instances in both the CSV files (as demonstrated in the sample data from the csv files above).

I have gone through a few threads on this website such as this one. What I am trying to achieve is the exact opposite of what is asked here but I cannot understand the solution on that thread.

I have attempted to get some results but to no avail. I have attached the code that I am using below:

import csv

fOpen1=open('C:\Master.csv')
fOpen2=open('C:\Subset.csv')
fOutput1=open('C:\Untagged.csv', 'wb')


master=csv.reader(fOpen1)
subset=csv.reader(fOpen2)
untagged=csv.writer(fOutput1)


count=0

subsetCopy=list()

header1=master.next()
header2=subset.next()

untagged.writerow(header1)


for row2 in subset:
    subsetCopy.append(row2)


for row1 in master:
    for row2 in subsetCopy:
        if row1[0] != row2[0]:
            count=count+1
            untagged.writerow(row1)

print count

When I run this I get very absurd results in the order of millions (count). The weird thing is I used this exact same code without != (used == instead) to achieve another goal and it worked like a charm. I thought changing the equality condition should give me the opposite result. Instead it ends up producing a huge csv file with nothiing useful. I also tried to use a dictionary but then realised it may not work because of duplication of records in both the files. It is important for me to get all the instances of a particular row in both the files.

Where am I going wrong? Any advice/suggestions are welcome.

Community
  • 1
  • 1
VGu
  • 386
  • 5
  • 23
  • The pandas library can be very useful for this type of thing. If I understand your question correctly, the program should out out rows with ID 11 and 16? – Charlie Haley Aug 17 '15 at 06:00

3 Answers3

3

What you are doing wrong is in this loop -

for row1 in master:
    for row2 in subsetCopy:
        if row1[0] != row2[0]:
            count=count+1
            untagged.writerow(row1)

For each row1 in master it would compare the id against row2 (using id ) and then if they are not equal would write that row1 to untagged. This would cause row1 to be written to untagged many times , as many times as there are unrelated rows in subsetCopy , and this is not how you check whether row1's id exists in subset.

You would need first go through each row in subsetCopy and then save it in a set, and then compare what you are checking for against that set. Example -

import csv

fOpen1=open('C:\Master.csv')
fOpen2=open('C:\Subset.csv')
fOutput1=open('C:\Untagged.csv', 'wb')

master=csv.reader(fOpen1)
subset=csv.reader(fOpen2)
untagged=csv.writer(fOutput1)

count=0

subsetCopy=set()
header1=master.next()
header2=subset.next()
untagged.writerow(header1)

for row2 in subset:
    subsetCopy.add(row2[0])

for row1 in master:
    if row1[0] not in subsetCopy:
            count=count+1
            untagged.writerow(row1)
print count
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • Thank you all for pointing out the flaw in the for loop. It didn't occur to me. @Anand- I just tried your approach and I am getting the wrong count. Just thinking out loud, the number of rows in the untagged csv and subset should be equal to master, shouldn't it? That's not what's happening in this case. Any ideas? – VGu Aug 17 '15 at 06:26
  • How much are your getting then? How far off are they? – Anand S Kumar Aug 17 '15 at 06:30
  • Master list = 8334 Subset list = 7497 count from your code = 968 Expected count = 837 – VGu Aug 17 '15 at 06:34
  • There may be some rows in subset list for which id is there in master list, but the row itself is not there in master list, did you check that possibility. I do not see any issue in the code as such . Or even some rows in subset list without any id in master list, or duplicated rows in subset list. – Anand S Kumar Aug 17 '15 at 06:37
  • Good question Anand! Unfortunately I am done for the day at my work! I won't be back to test until Thurs. Hopefully I can track you down then. Thank you so much. Would you know if I can drop you a personal message? Cheers. – VGu Aug 17 '15 at 06:51
  • Hi Anand, I had a look and looks like some erroneous rows have crept into the subset csv (possibly because of an error in one the python scripts used to cleanse raw data which produces the subset csv). One other observation with your code I didn't bring up the other day. Does **subsetCopy.add(row2[0])** actually work or should it be **.append**? Thank you once again. – VGu Aug 20 '15 at 00:51
  • Yea it adds the elements to the set, set do not have `.append()` , instead they have `.add()` . – Anand S Kumar Aug 20 '15 at 00:53
  • Thank you for your assistance Anand. It works okay now. – VGu Aug 20 '15 at 01:54
1

First of all you want to use a set which will make this a lot faster.

That being said, you're comparing each line of file 1 with each line of file 2. If they don't match (common case) - you're adding them to the untagged list.

It's important to notice that for any file with more than one line all lines have lines they do not match - which causes them to be written to the result that many times.

Set operations are included in set and you can utilize them for your own good, specifically difference which is described as:

new set with elements in s but not in t

set operations table

Here are the changes that need to be made:

Create a set from all key elements in subset and master using generator expressions:

subsetSet = set(s[0] for s in subset)
masterSet = set(m[0] for m in master)

Find difference of key elements:

untagged_keys = subset - master
print untagged_keys

Now use these keys when you read from master to only read untagged lines:

with open('C:\Master.csv') as f:
    master = csv.reader(f)
    untagged = [m for m in master if m and m[0] in untagged_keys]

print untagged
Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
0

For the first part, even if you find a line from master in subset, you continue to compare it to other lines and add it in untagged.

Your for loop should be:

for row1 in master:
    for row2 in subsetCopy:
        if row1[0] == row2[0]: 
            break
    else:
        count=count+1
        untagged.writerow(row1)

That is: if the line is anywhere in subset ignore it, and only count it if you went through subset2 without finding it.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252