1

I hope someone can help.

I'm in the situation where I need to delete rows from one CSV file based on another CSV. Consider this simple example:

Time                Some Column
4/25/2016 06:20:00  A
4/25/2016 06:20:01  B
4/25/2016 06:20:02  B 
4/25/2016 06:20:03  B
4/25/2016 06:20:04  A
4/25/2016 06:20:05  A

Then I have another file:

Time                Block
4/25/2016 06:20:00  Block B for 10 seconds

I should be able to use the second file and read it into my program such that it will delete any rows with with 'B' in 'Some Column' for 10 seconds after 6:20:00, so in effect, I need some function that will look at the first and second CSV file and produce this for me:

Time                    Some Column
4/25/2016 06:20:00      A
4/25/2016 06:20:04      A
4/25/2016 06:20:05      A   

Please note that the CSV's I am dealing with have over 3 million rows so using something slow like openpyxl is not really an option, any ideas?

MathsIsHard
  • 277
  • 3
  • 9
  • Take a look at the highest scoring answer to [this question](https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression). – K. Nielson Jul 26 '17 at 12:38

1 Answers1

1

One way you could do this is to use pd.merge_asof to help with 10-second interval. Merge the two files together on Time with a tolarance equal to pd.Timedelta(10, unit='s') filtering file1 to only the 'B'. Drop from file1 those records that were returned from the merge_asof.

from io import StringIO
csv_file1 = StringIO("""Time                Some Column
4/25/2016 06:20:00  A
4/25/2016 06:20:01  B
4/25/2016 06:20:02  B 
4/25/2016 06:20:03  B
4/25/2016 06:20:04  A
4/25/2016 06:20:05  A""")

csv_file2 = StringIO("""Time                Block
4/25/2016 06:20:00  Block B for 10 seconds""")

df1 = pd.read_csv(csv_file1, sep='\s\s+', index_col='Time', engine='python', parse_dates=True)
df2 = pd.read_csv(csv_file2, sep='\s\s+', index_col='Time', engine='python', parse_dates=True)

df_out = (df1.drop(pd.merge_asof(df1[df1['Some Column'] == 'B'],
                                 df2, 
                                 right_index=True, 
                                 left_index=True,
                                 tolerance=pd.Timedelta(10, unit='s')).index))

print(df_out.reset_index())

Output:

                Time Some Column
0 2016-04-25 06:20:00           A
1 2016-04-25 06:20:04           A
2 2016-04-25 06:20:05           A
Scott Boston
  • 147,308
  • 15
  • 139
  • 187