1

I am a beginning python user.

How do you write a row to a csv file by comparing a separate list of text patterns, and excluding rows with the matching pattern?

Here is a specific example:

listfile: spam, eggs, bacon,

csvfile:  
          col 1   col 2     col 3
row 1     zzz    not eggs   zzz
    2     xxx    bacon      qqq
    3     eee    not bacon  ttt
    4     ttt    eggs       hhh
    5     ggg    not spam   ppp
    6     yyy    eggs       www

The csv file I need to write is supposed to have only rows 1, 3 and 5, because col 2 value did not match any of the values of the list for those rows.

Assuming the below filedata, how would I write this?

mycsv = csv.reader(open('spameggsbacon.csv'))
listfile = listfile.txt
for row in mycsv:
   text = row[1]
   writecsvfile = open('write.csv', 'a') 

EDIT: based on Md Johirul Islam's answer, I tried:

import csv
import pandas as pd
data = pd.read_csv('spameggsbacon.csv')
listfiledata = 'listfile.txt'
with open(listfiledata) as f:
    listfiledata = f.readlines()
listfiledata = [x.strip() for x in listfiledata] 
data = data[~data['col2'].isin(listfiledata)]
data.to_csv('spameggsbacon.csv', sep=',')
print(listfiledata)
print(data.head)

The code runs, but does not remove the rows that have matching values. It appears the reason has to do with how this line is written:

 data = data[~data['col2'].isin(listfiledata)]

Edit 2: Not sure if it matters, but I revised the original example to clarify that the values in col2 may repeat, for example, 'eggs' appears in both row 4 and row 6

Edit 3:

Here is what you see if you run

 print(listfiledata)
 print(data.head)

Output is:

['spam,eggs,bacon']
<bound method NDFrame.head of   col1       col2 col3
0    zzz   not eggs  zzz
1    zzz      bacon  zzz
2    zzz  not bacon  zzz
3    zzz       eggs  zzz
4    zzz   not spam  zzz
5    zzz       eggs  zzz>
user3447273
  • 351
  • 1
  • 4
  • 12
  • How are the columns separated in your csv file? Since you have words like "not eggs" in it, your delimiter can't be simply whitespace. And how do you provide the file list data? Is it a list of strings? – Mr. T Feb 11 '18 at 19:29
  • I use commas to separate the values. Does it matter if you are using csv.reader? Split method would not work, but csv.reader is supposed to work, even with spaces, right? – user3447273 Feb 11 '18 at 20:20
  • Of course, [csv.reader](https://docs.python.org/3.6/library/csv.html#csv.Sniffer) accepts different delimiters and you can use [csv.sniffer](https://stackoverflow.com/questions/16312104/can-i-import-a-csv-file-and-automatically-infer-the-delimiter#16312203) to guess the dialect. But it is better to be explicit than implicit, to prevent unpleasant surprises. Some programs use two spaces, when columns contain also whitespace separated entries. CSV reader can't guess this behaviour. – Mr. T Feb 11 '18 at 20:34
  • See your output `['spam,eggs,bacon']`? Is it a list of string? Try to debug your code and split `'spam,eggs,bacon'` to form a list of restricted words – Md Johirul Islam Feb 12 '18 at 00:03
  • Form list of restricted words `['spam','eggs','bacon']` – Md Johirul Islam Feb 12 '18 at 00:09
  • It is a comma separated list: spam,eggs,bacon. I don't know how to form a list of restricted words. I tried `listfiledata.split()` but the result was the same. – user3447273 Feb 12 '18 at 00:12

2 Answers2

1

Try to use the pandas library. For example lets consider your intitial csv file name is spameggsbacon.csv with the data

          col 1   col 2     col 3
row 1     zzz    not eggs   zzz
    2     xxx    bacon      qqq
    3     eee    not bacon  ttt
    4     ttt    eggs       hhh
    5     ggg    not spam   ppp
    6     yyy    spam       www

You read this data as a DataFrame

import pandas as pd
data = pd.read_csv('spameggsbacon.csv')

Then you can apply filtering on this DataFrame

listfiledata = ['spam', 'eggs', 'bacon']
data = data[~data['col 2'].isin(listfiledata)]

ow the filtered DataFrame will not contain the rows that have any entry from the listfiledata. Now you can rewrite the csv file using following code with your filtered data.

data.to_csv('spameggsbacon.csv', sep=',')

You can also write to a new file. Just change the file name.

Md Johirul Islam
  • 5,042
  • 4
  • 23
  • 56
  • Thanks. How do you bring in the data from the text file (there are 40 values in the actual file) to get: `listfiledata = ['spam', 'eggs', 'bacon']` in your example? – user3447273 Feb 11 '18 at 19:29
  • You have to read the data into a list https://stackoverflow.com/questions/3277503/how-do-i-read-a-file-line-by-line-into-a-list – Md Johirul Islam Feb 11 '18 at 19:30
  • Regarding `data = data[~data['col 2'].isin(listfiledata)]`, what if 'col 2' does not have a header title? Using `data = data[~data[1].isin(listfiledata)]` does not work. – user3447273 Feb 11 '18 at 20:00
  • when you read data using pandas. it will assign some header. try to print `data.head()` and you will see what headers are applied – Md Johirul Islam Feb 11 '18 at 20:19
  • `data.head()` does not print any made up headers. It uses the first row contents as the header. I will revise code to keep the headers when that file is created. – user3447273 Feb 11 '18 at 20:54
  • Ok I revised the code to include the header and the code runs, however, the line `data = data[~data['col 2'].isin(listfiledata)]` did not exclude the rows. The file remained unchanged. I tested by creating a new file. They are the same. Will revise my question. – user3447273 Feb 11 '18 at 21:19
  • I tested with only the values and data in my example, and got the same result. It did not exclude the data as intended. – user3447273 Feb 11 '18 at 22:14
  • Can you try to print `listdata` and print `data.head()`. I want to see what are being printed – Md Johirul Islam Feb 11 '18 at 23:25
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164939/discussion-between-user3447273-and-md-johirul-islam). – user3447273 Feb 11 '18 at 23:54
0

Here is the answer that worked:

import csv
import pandas as pd
import os

df1 = pd.read_csv('spameggsbacon.csv') # file with multiple columns
df2 = pd.read_csv('listfile.csv') # file with one column

mask = df1.iloc[:,1].isin(df2.iloc[:,0])
df1[~mask].to_csv('spameggsbacon2.csv', index=False)

print(df1[~mask])

This worked with the example shown, but it does not work (re: file with one column) with larger lists, such as 40 values instead of 3.

user3447273
  • 351
  • 1
  • 4
  • 12