Suppress one csv file against another where values contain specific strings

Question

I am trying to remove all the lines in file1.csv that contain the strings from file2.csv with python. I want it to search for all the values in column1 of file1.csv and remove entire rows where column1 contains in its value same string as in file2.csv.

I know grep -v in bash can do same thing with just one command. However, I need to suppress file1.csv against over 40,000 possible strings in file2.csv. Bash takes forever and even crashes when executing this command.

Does anyone know a solid script that can do what grep -v does in python but when suppressing against a file with thousands of strings?

Just to make sure it's clear:

File1.csv:

column1,column2,column3
www.gamai4xheifw.com,4410,22
www.vfekjfwo11k.com,772,100
www.gosi4xnbdn.com,1793,39
www.tum33kkwfl.com,1100,2
www.eei4xelwf.com,9982,14

File2.csv:

column1
i4x

File3.csv:

column1,column2,column3
www.vfekjfwo11k.com,772,100
www.tum33kkwfl.com,1100,2

But, again, I need it in python because the number of the strings in file2.csv is over 40,000.

So I understand each row in file1 is to be compared against 40000 strings in file2? — pyeR_biz, Jun 09 '18 at 22:21

jpp · Accepted Answer · 2018-06-09T23:11:23.453

2

One solution which may work for your use case is 3rd party library Pandas + regex.

However, I strongly recommend you utilise a more efficient algorithm, for example one that implements the trie-based Aho-Corasick, such as this solution.

import pandas as pd
from io import StringIO

mystr1 = StringIO("""column1,column2,column3
www.gamai4xheifw.com,4410,22
www.vfekjfwo11k.com,772,100
www.gosi4xnbdn.com,1793,39
www.tum33kkwfl.com,1100,2
www.eei4xelwf.com,9982,14""")

mystr2 = StringIO("""column1
i4x""")

# read files, replace mystr1 / mystr2 with 'File1.csv' / 'File2.csv'
df = pd.read_csv(mystr1)
df_filter = pd.read_csv(mystr2)

# create regex string from filter values
str_filter = '|'.join(df_filter['column1'])

# apply filtering
df = df[~df['column1'].str.contains(str_filter)]

# export back to csv
df.to_csv('file_out.csv', index=False)

print(df)

               column1  column2  column3
1  www.vfekjfwo11k.com      772      100
3   www.tum33kkwfl.com     1100        2

edited Jun 09 '18 at 23:11

answered Jun 09 '18 at 22:36

jpp

159,742
34
281
339

Hi @jpp, thanks for the help! I was wondering how can I define mystr1 and mystr2 if the number of the lines in both file1.csv and file2.csv is dozens of thousands? Is there a shorter way to put in the script? File2.csv has 500k rows and file2.csv has 45k rows. None of them will fit in the script when I execute it in python. Or am I missing something? – Billy the Poet Jun 09 '18 at 23:08
Replace them with your filenames, i.e. `'File1.csv'` and `'File2.csv'`. I'm including your data in my question to show a reproducible result. – jpp Jun 09 '18 at 23:10
Ok, but I keep getting this error, "TypeError: 'Series' objects are mutable, thus they cannot be hashed" after running df = df[~df['column1'].str.contains(str_filter) – Billy the Poet Jun 09 '18 at 23:23
@BillythePoet, Seems like you csv files aren't in exactly the format shown in your question.. As you can see from my script, with the data as provided, it works. – jpp Jun 09 '18 at 23:24
So I tried it on different data sets. It seems to be working with those >10mb in size. Anything above, I get incorrect output with some lines supposed to be suppressed still in the file. But this is certainly better than I've found elsewhere :) – Billy the Poet Jun 11 '18 at 23:45

Suppress one csv file against another where values contain specific strings

1 Answers1