Python: How to compare two csv files and add classifier in first file if value exists in second file

Question

I try to compare two csv files. First file (movements.csv) has 14 columns, second csv (LCC.csv) one single column. I want to check whether the entries (strings) of column 8 in movements.csv appear somewhere in column 1 of LCC.csv. If so, in column 14 a 'Yes' should be written, if not a 'No'. The code I tried so far is and the error message I receive:

import csv

f1 = file('LCC.csv', 'rb') 
f2 = file('movements.csv', 'rb')
f3 = ('output.csv', 'wb') 

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

movements = list(c2)

for LCC_row in c1:
    row = 0
    found = False
    for movements_row in movements:
        output_row = movements_row
        if movements_row[7] == LCC_row[0]
            output_row.append('Yes')
            found = True
            break
        row += 1
    if not found:
        output_row.append('No')
    c3.writerow(output_row)

f1.close()
f2.close()
f3.close()

enter image description here

I'm a complete beginner with python, so any advice is appreciated! Optimally the check between the two columns would also disregard whether the strings are written in capital letters or not.

The error message comes after

c3.writerow(output_row)

as

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
>>>

LCC.csv (no header):

Air Ab  
Jamb  
Sw  
AIRF  
EURO

movements.csv (has a header):

ap,ic,year,y_m,pas,da,ty,airl,ic_a,dest_orig,ic_d,coun,cont,LCC  
Zue,LSZH,2005,200501,25,1/1/2005,Dep,"EURO",EUJ,"Mans C",EG,Gb,Eu,   
Zue,LSZH,2005,200501,204,1/1/2005,Arr,"Sw",SWR,"Dar",HA,Tans,A,   
Ba,LSZM,2005,200501,191,1/1/2005,Arr,"AIRF",AFR,"PG",LG,Fr,Eu,   
Zue,LSZH,2005,200501,228,1/1/2005,Dep,"THA",THA,Bang,VD,Th,As,

as already said, the last column (LCC) is completely empty at the moment

I receive an error message after if movements_row[7] == LCC_row[0], namely: File "", line 6 if movements_row[7] == LCC_row[0] ^ SyntaxError: invalid syntax — Anna Stünzi, Dec 19 '16 at 13:54
Please edit your question with the error message. And clearly mark which line causes it. — Code-Apprentice, Dec 19 '16 at 13:55
@AnnaStünzi: would it be ok to use pandas to solve this problem ?? — Vikash Singh, Dec 19 '16 at 14:03
I have fixed the code bugs and changed it to make it work. Please check my answer. — Vikash Singh, Dec 19 '16 at 14:15
Hi @VikashSingh, thanks a lot for your help. I have not used pandas yet but if it works, yes this would be ok :). I edited the question using your code and I added the error message I still receive (@Code-Apprentice: is this how you ask me to do it)? — Anna Stünzi, Dec 19 '16 at 14:57

Moinuddin Quadri · Answer 1 · 2016-12-19T14:01:44.770

1

It has many issues. Few which I found after glancing at the code are:

You having invalid quote ' in your line:

f2 = file('movements.csv', ,rb')
#                          ^

It should be:

f2 = file('movements.csv', 'rb')

In the code you shared you are having ` back quote at various places instead of single quote '. For example, your lines should be:
```
f1 = file('LCC.csv', 'rb') 
f3 = file('output.csv', 'wb')    
#     ^ also missing file here
```

Missing colon : after if. It should be:

if movements_row[7] == LCC_row[0]:
#                           Here ^

Also, for initializing the string, you do not need parenthesis. Just assign it like:

output_row[13] = 'Yes'
#                ^ As simple string

edited Dec 19 '16 at 14:01

answered Dec 19 '16 at 13:54

Moinuddin Quadri

46,825
13
96
126

` is called a back quote or back tick. – Code-Apprentice Dec 19 '16 at 13:58
Yes sorry, this is a copy paste error here in the forum, in the code I use I have the ' everywhere, I just checked again – Anna Stünzi Dec 19 '16 at 13:59
There is still some mistakes – iFlo Dec 19 '16 at 14:01
1

@iFlo: Yes, thats what even I find out. Every time I see code, I find few. And I am not checking for logical errors – Moinuddin Quadri Dec 19 '16 at 14:03
Thanks to all for your help. Adding the missing colon helped for the syntax error. However now I receive another message: Traceback (most recent call last): File "", line 1, in _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? – Anna Stünzi Dec 19 '16 at 14:06
1

@MoinuddinQuadri SyntaxError almost always means missing or incorrect punctuation. I commonly get this when I forget colons and closing parentheses. To track down the problem, start at the line indicated in the error message and work backwards. – Code-Apprentice Dec 19 '16 at 14:09
@AnnaStünzi: You should check Vikas' answer. He wrote a working prototype – Moinuddin Quadri Dec 19 '16 at 14:12
@AnnaStünzi Please edit your question with each error message. – Code-Apprentice Dec 19 '16 at 14:14

score 0 · Answer 2 · answered Dec 19 '16 at 14:05

0

You're trying to do too much at the same time. Split this into different tasks. First we'll read the contents of LCC.csv into a set (we could use a list, but sets are better for determining membership). Then we will go through movements.csv to rewrite it.

import csv

with open('LCC.csv', 'rb') as lcc:
    lcc_set = set()
    lcc_r = csv.reader(lcc)
    for l in lcc_r:
        for i in l:
            lcc_set.add(i)

with open('movements.csv', 'rb') as movements:
    mov_r = csv.reader(movements)
    with open('output.csv', 'wb') as output:
        out_w = csv.writer(output)
        for l in mov_r:
            #l.pop()
            if l[7] in lcc_set:
                l.append('Yes')
            else:
                l.append('No')
            out_w.writerow(l)

I'm not clear if you wanted to add a column or replace the last one. I've commented out the line that will cause the last column to be replaced by Yes or No

answered Dec 19 '16 at 14:05

Patrick Haugh

59,226
13
88
96

Hi Patrick, thanks for your help too. Using your code I get following error message: Traceback (most recent call last): File "", line 11, in AttributeError: '_csv.writer' object has no attribute 'writerowl' >>> do you see a reason why? Thank you so much! – Anna Stünzi Dec 19 '16 at 14:36
@AnnaStünzi It looks like you're missing the parenthesis `writerowl` -> `writerow(l)` – Patrick Haugh Dec 19 '16 at 14:41
sorry, yes - the code is not giving any error anymore, but there is no classification in the output.csv, all rows in column 14 have now entry 'No'.. – Anna Stünzi Dec 19 '16 at 15:13
It looks like that's because it's `Sw` in one file and `"Sw"` in the other. do `if l[7].strip('"') in lcc_set` instead – Patrick Haugh Dec 19 '16 at 15:21

score 0 · Accepted Answer · edited May 23 '17 at 11:45

0

There are quite a few bugs in your code. They have been pointed out here: https://stackoverflow.com/a/41224147/3027854

One problem with moments.csv

ap,ic,year,y_m,pas,da,ty,airl,ic_a,dest_orig,ic_d,coun,cont,LCC 
Zue,LSZH,2005,200501,25,1/1/2005,Dep,"EURO",EUJ,"Mans C",EG,Gb,Eu, 
Zue,LSZH,2005,200501,204,1/1/2005,Arr,"Sw",SWR,"Dar",HA,Tans,A, 
Ba,LSZM,2005,200501,191,1/1/2005,Arr,"AIRF",AFR,"PG",LG,Fr,Eu, 
Zue,LSZH,2005,200501,228,1/1/2005,Dep,"THA",THA,Bang,VD,Th,As,

except the header line you have one extra column in each line. As they end with ", ". I have added handling for that in my code

import csv

f1 = open('LCC.csv', 'rU') 
f2 = open('movements.csv', 'rU')
f3 = open('output.csv', 'w') 

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

# first we will read all LCC values into a set.
LCC_row_values = set()
for LCC_row in c1:
    LCC_row_values.add(LCC_row[0].strip())

row = 0
for movements_row in c2:
    row += 1
    if row == 1:
        # movements_row.append('is_present')
        # c3.writerow(movements_row)
        # skip header of moments.csv file
        continue
    # Remove last extra column from output row
    output_row = movements_row[:-1]
    if movements_row[7] in LCC_row_values:
        output_row.append('Yes')
    else:
        output_row.append('No')
    c3.writerow(output_row)

f1.close()
f2.close()
f3.close()

Here example files are

LCC.csv

Air Ab 
Jamb 
Sw 
AIRF 
EURO

movements.csv

ap,ic,year,y_m,pas,da,ty,airl,ic_a,dest_orig,ic_d,coun,cont,LCC 
Zue,LSZH,2005,200501,25,1/1/2005,Dep,"EURO",EUJ,"Mans C",EG,Gb,Eu, 
Zue,LSZH,2005,200501,204,1/1/2005,Arr,"Sw",SWR,"Dar",HA,Tans,A, 
Ba,LSZM,2005,200501,191,1/1/2005,Arr,"AIRF",AFR,"PG",LG,Fr,Eu, 
Zue,LSZH,2005,200501,228,1/1/2005,Dep,"THA",THA,Bang,VD,Th,As,

output.csv

Zue,LSZH,2005,200501,25,1/1/2005,Dep,EURO,EUJ,Mans C,EG,Gb,Eu,Yes
Zue,LSZH,2005,200501,204,1/1/2005,Arr,Sw,SWR,Dar,HA,Tans,A,Yes
Ba,LSZM,2005,200501,191,1/1/2005,Arr,AIRF,AFR,PG,LG,Fr,Eu,Yes
Zue,LSZH,2005,200501,228,1/1/2005,Dep,THA,THA,Bang,VD,Th,As,No

edited May 23 '17 at 11:45

Community

1
1

answered Dec 19 '16 at 14:07

Vikash Singh

13,213
8
40
70

Hi, thanks a lot! I adapted your code so that column 8 is compared to column 1. However, I still receive an error message: Traceback (most recent call last): File "", line 1, in _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? >>> What should I do? Also, I do not want to add a new column but fill the last one (column 14) which is currently empty. Thanks! – Anna Stünzi Dec 19 '16 at 14:24
Please share top 5 lines of both CSV files. Put it in the original question. It will help me solve your problem. – Vikash Singh Dec 19 '16 at 14:53
please check now. Do you need header in output file? – Vikash Singh Dec 19 '16 at 15:32
I still receive quite the same error message: Traceback (most recent call last): File "", line 7, in _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? I don't need a header in the output file. – Anna Stünzi Dec 19 '16 at 15:47
@AnnaStünzi line 7 has some issue in the csv file. Can you add that line in the sample you have shared .. – Vikash Singh Dec 19 '16 at 15:49
@AnnaStünzi sorry, seems like some other issue. try reading input file like this `f1 = open('LCC.csv', 'rU')` `f2 = open('movements.csv', 'rU')` . source http://stackoverflow.com/questions/6726953/open-the-file-in-universal-newline-mode-using-the-csv-django-module – Vikash Singh Dec 19 '16 at 15:51
now I don't get this error message anymore but the column is filled with 'No's only. Do I need to open output file as well like this: f3 = open('output.csv', 'wU')? – Anna Stünzi Dec 19 '16 at 16:13
no. output file should be open as `f3 = open('output.csv', 'w')` – Vikash Singh Dec 19 '16 at 16:17
@AnnaStünzi sorry the column no in `movements_row[8]` should be 7. – Vikash Singh Dec 19 '16 at 16:20
@AnnaStünzi Sorry. I just realised the mistake I was making. Corrected it now. Should work. Please try and let me know. – Vikash Singh Dec 19 '16 at 17:50
@AnnaStünzi Waiting for your feedback on the answer. – Vikash Singh Dec 20 '16 at 06:03

Python: How to compare two csv files and add classifier in first file if value exists in second file

3 Answers3