1

I'm trying to write a Python code using Pandas to filter out a CSV file, dependent on conditions set from another CSV file.

The CSV I want to filter looks something like this:

date              product
01/05/2020        Test Product 1
02/05/2020        Test Product 1
03/05/2020        Test Product 1
04/05/2020        Test Product 1
05/05/2020        Test Product 1
06/05/2020        Test Product 1
07/05/2020        Test Product 1
01/05/2020        Test Product 2
02/05/2020        Test Product 2
03/05/2020        Test Product 2
04/05/2020        Test Product 2
05/05/2020        Test Product 2
06/05/2020        Test Product 2
07/05/2020        Test Product 2

And the CSV with the condition something like this:

product               start_date
Test Product 1        01/05/2020
Test Product 2        04/05/2020

What I'm looking to do, is filter the first CSV so that any Test Product that has a start_date after 01/05, will have the relevant rows deleted. For example, Test Product 2 has start_date 04/05/2020, which means in the first CSV i'm trying to delete the rows for 01/05, 02/05 and 03/05 for that product.

This would be the desired output:

    date              product
01/05/2020        Test Product 1
02/05/2020        Test Product 1
03/05/2020        Test Product 1
04/05/2020        Test Product 1
05/05/2020        Test Product 1
06/05/2020        Test Product 1
07/05/2020        Test Product 1
04/05/2020        Test Product 2
05/05/2020        Test Product 2
06/05/2020        Test Product 2
07/05/2020        Test Product 2

What's the best way to do this? I've been trying to do it with so many ways using dataframes for a long time, but haven't reached any correct script...

Many thanks in advance!

MTavares
  • 27
  • 7

1 Answers1

1

We can use boolean indexing with Series.map

df_filtered = df1.loc[df1['date'].ge(df1['product']\
                                     .map(df2.set_index('product')['start_date']))]
print(df_filtered)
          date         product
0   01/05/2020  Test Product 1
1   02/05/2020  Test Product 1
2   03/05/2020  Test Product 1
3   04/05/2020  Test Product 1
4   05/05/2020  Test Product 1
5   06/05/2020  Test Product 1
6   07/05/2020  Test Product 1
10  04/05/2020  Test Product 2
11  05/05/2020  Test Product 2
12  06/05/2020  Test Product 2
13  07/05/2020  Test Product 2

In order to read your csv file we use pd.read_csv

ansev
  • 30,322
  • 5
  • 17
  • 31
  • Thanks for the reply! I'm not very familiar with boolean indexing or series.map. Do you know why i got the following error point at the very last square bracket while trying your code: "SyntaxError: unexpected character after line continuation character"? – MTavares May 03 '20 at 17:59
  • Nevermind, I forgot to put .map in a new line. Now, I'm getting the following error: "TypeError: '>=' not supported between instances of 'str' and 'int'". Possibly meaning I have to turn both dates into timestamps? – MTavares May 03 '20 at 18:05
  • Yes you need df1['date']=pd.to_datetime(df1['date']) and the same for 'start_date' if df2 previously – ansev May 03 '20 at 18:10
  • Maybe yo need pass to `pd.to_datetime` function `format='%d/%m/%Y'` – ansev May 03 '20 at 18:17
  • Thanks, that almost worked. It deleted 01/05 to 03/05 from Test Product 2 (which is what was desired), but also deleted 07/05 from both products, which is very strange... – MTavares May 03 '20 at 18:19
  • Have you pass the format? I think pandas read by default month/day/year, so I think you need pass the format that I wrote in the last comment – ansev May 03 '20 at 18:21
  • I think that might be it yes. Currently trying to fix this error that came up when I included the format in the datetime function: ValueError: time data '2020-05-01' does not match format '%d/%m/%Y' (match) – MTavares May 03 '20 at 18:32
  • if you have time data like '2020-05-01' then you need `format = '%Y-%d-%m'` – ansev May 04 '20 at 07:21
  • All of my dates are dd/mm/yyyy, not sure where the script is getting the yyyy-mm-dd to be honest – MTavares May 04 '20 at 14:18
  • Nevermind, I had a function in my script which was doing calculations with the dates, and its output was in a different format than the rest. It's now fixed and no errors came up, your solution worked, thank you very much!! :) – MTavares May 04 '20 at 14:31
  • Hi, sorry to bother again. I'm getting a InvalidIndexError if I use your script with duplicated products (but with different clients - which I'm trying to keep). Do you know if there's a way of altering your script to allow duplicated products? I believe the problem comes from set_index('product') – MTavares May 11 '20 at 19:34
  • the problem we need is that we need to map for each client and product to avoid duplicates. But the method is not Series.map but DataFrame.merge, could you edit the question with new dataframe of examples? although it would be better if you asked a new different question so as not to lengthen this one :) – ansev May 11 '20 at 19:43
  • Hi, I actually have that in a new question too, but thought I would try my luck here as you were very helpful last time! Here it is: https://stackoverflow.com/questions/61712420/python-pandas-invalidindexerror – MTavares May 11 '20 at 20:03