0

I am new to Python, so please forgive me. I have pieced this together through things i've found online, however, it's still not working exactly as it should.

I'm wanting a python script that will look in a given spreadsheet (list.csv), parse it for any "key_words", then export a file of only the rows that DO NOT contain any "key_words" called "cleaned.csv". I would like for it to only look in the first column, [0]. If possible, I would like for it to also export me a second spreadsheet of the ones that DO contain keywords, just to verify what it is scraping out.

This current code looks at the entire csv file and I see it not putting some of the rows in "cleaned.csv", when technically, it should be, unless there is a problem with my array.

Here is my current code...

key_words = [ 'Dog', 'Cat', 'Bird', 'Cow', ]

with open('list.csv') as oldfile, open('cleaned.csv', 'w') as newfile:
    for line in oldfile:
        if not any(key_word in line for key_word in key_words):
            newfile.write(line)

First couple rows of data are...

Dog,Walks,Land,4legs,
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,

Cleaned.csv should show:

Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,

Other.csv (bad, matching array) should show:

Dog,Walks,Land,4legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
codyfraley
  • 99
  • 3
  • 12
  • Possible duplicate of [Read Specific Columns from csv file with Python csv](http://stackoverflow.com/questions/16503560/read-specific-columns-from-csv-file-with-python-csv) – tripleee Jun 03 '16 at 04:18
  • @tripleee OP does not know how to use if conditionals – Merlin Jun 03 '16 at 04:36

2 Answers2

1

Well code looks fine and worked for me, so there is no problem with it per se.

If you only want to check in the first row you have to split the line by ",":

key_words = ['Dog', 'Cat', 'Bird', 'Cow', ]

with open('list.csv') as oldfile, open('cleaned.csv', 'w') as cleaned, open("matched.csv", "w") as matched:
    for line in oldfile:
        if not any(key_word in line.split(",", 1)[0] for key_word in key_words):
            cleaned.write(line)
        else:
            matched.write(line)

If first column is always a "word" and not a "sentence" (like Dog is out) then you could improve the test like this:

if not line.split(",", 1)[0] in key_words:

NOTE: with strings test be careful with case sensitivity.

Note that providing a maxsplit=1 here line.split(",", 1) will improve string parsing performance especially if you have longer lines because it will stop parsing after finding first , and return a list of 2 items. First item will be your first column. Read more here:

https://docs.python.org/2/library/stdtypes.html#str.split

TEST RESULT:

mac: cat list.csv
Dog,Walks,Land,4legs,
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,

mac: cat cleaned.csv
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,

mac: cat matched.csv
Dog,Walks,Land,4legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
Dmitry Tokarev
  • 1,851
  • 15
  • 29
  • I think I'm understanding what you're saying. Can I message you for better clarification? – codyfraley Jun 03 '16 at 05:17
  • @CFraley sure. Glad to help. – Dmitry Tokarev Jun 03 '16 at 05:18
  • @CFraley i think you need to have reputation of 20 to chat: http://stackoverflow.com/help/privileges/chat. But you can post your questions here in comments about the answer. Also consider using https://docs.python.org/2/library/csv.html for all your csv file processing needs in future. – Dmitry Tokarev Jun 03 '16 at 05:29
  • This definitely works. I was up way too late last night and I was spinning my wheels for nothing. Long story short, I had the .py script saved in 2 different locations. The one I was modifying with your changes was saved in one place, the alias in my terminal was pointing to another. Thanks so much for your help! I appreciate you explaining everything in detail. – codyfraley Jun 03 '16 at 14:50
0

Here is a pure pandas method:

In [51]:
key_words = [ 'Dog', 'Cat', 'Bird', 'Cow']
t="""Dog,Walks,Land,4legs
Fish,Swims,Water,fins
Kangaroo,Hops,Land,2legs
Cow,Walks,Land,4legs
Bird,Flies,Air,2legs"""
df = pd.read_csv(io.StringIO(t), header=None)
df

Out[51]:
          0      1      2      3
0       Dog  Walks   Land  4legs
1      Fish  Swims  Water   fins
2  Kangaroo   Hops   Land  2legs
3       Cow  Walks   Land  4legs
4      Bird  Flies    Air  2legs

we can create a regex pattern and pass this to str.contains and negate the boolean condition to mask the df prior to calling to_csv:

In [55]:    
pat = '|'.join(key_words)
df[df.apply(lambda x: ~x.str.contains(pat).any(), axis=1)]

Out[55]:
          0      1      2      3
1      Fish  Swims  Water   fins
2  Kangaroo   Hops   Land  2legs

So we use apply with param axis=1 to apply our lambda row-wise, we test the negated str.contains with any to see if any column does not contain our keywords:

In [56]:
df.apply(lambda x: ~x.str.contains(pat).any(), axis=1)

Out[56]:
0    False
1     True
2     True
3    False
4    False
dtype: bool
EdChum
  • 376,765
  • 198
  • 813
  • 562