0

I have a local dataframe that gets appended with new entries daily. Once in a while, an old entry is updated. The give away is a bunch of columns will match, but the timestamp is more recent.

With the goal of removing the old entry, and keeping the new (updated) entry, I append the new entry and then "clean" the dataframe by looping through the rows and finding the old entry:

del_rows=[]
df2 = df.copy()
for index, row in df.iterrows():
    for index2, row2 in df2.iterrows():
        if row["crit1"]==row2["crit1"] and row["date"] > row2["date"]:
            del_rows.append(index2)

df = df.drop(df.index[del_rows])

While functional, I'd love to know the more "pandas" way of going about this process. I know that apply and NumPy vectorization are faster; however, I can't think of a function that would achieve this that I could map apply to, or a way to use the vectorization given different data types.

user129818
  • 15
  • 6
  • Please try to include a simple [example dataset](https://stackoverflow.com/q/20109391/1222578) that shows what your data looks like. – Marius Oct 10 '18 at 23:20

3 Answers3

1

This can be done using a groupby on the crit1 and selecting the latest row, as such:

df.sort_values('date').groupby('crit1').tail(1)
Def_Os
  • 5,301
  • 5
  • 34
  • 63
  • I get that certain items could be removed with loc, but how would the script know the old v new items without checking each item against every other item? Or are you suggesting conditioning the df before appending the new item? – user129818 Oct 10 '18 at 23:35
  • I think this will work but the actual dataset has good number of additional criteria, and adding a couple of criteria in the subset portion of `df[~df.duplicated(subset=['crit1'], keep='last')]` seems like an easier way to go instead of repeated/levels of `groupby` – user129818 Oct 11 '18 at 14:21
  • @user129818 Makes sense. Just note that `keep='last'` keeps the last row encountered, which is not necessarily the _latest_ row in terms of date/time. – Def_Os Oct 11 '18 at 16:04
1

IIUC, you can use duplicated() to create a boolean filter, so for a sample dataframe:

    crit1        date
0   test1  01-01-2018
1   test2  01-02-2018
2   test3  01-03-2018
3   test4  01-04-2018
4   test5  01-05-2018
5   test6  01-06-2018
6   test3  01-07-2018
7   test7  01-08-2018
8   test8  01-09-2018
9   test2  01-10-2018
10  test9  01-11-2018

Simply do:

df[~df.duplicated(subset=['crit1'], keep='last')].reset_index(drop=True)

Yields:

   crit1        date
0  test1  01-01-2018
1  test4  01-04-2018
2  test5  01-05-2018
3  test6  01-06-2018
4  test3  01-07-2018
5  test7  01-08-2018
6  test8  01-09-2018
7  test2  01-10-2018
8  test9  01-11-2018
rahlf23
  • 8,869
  • 4
  • 24
  • 54
0

Probably the new entry has a date older than the one already existing. then doping simply by first or last might not be correct.

another alternative is to drop the duplicate by finding the minimum entry.

below is a worked out example.

import pandas as pd

date = pd.date_range(start='1/1/2018', end='1/5/2018')

crit = ['a', 'b', 'c', 'd', 'e']

df = pd.DataFrame({'crit':crit, 'date':date})

# insert a new entry to df
df.loc[len(df)] = ['b', '1/6/2016']

#convert date to datetime
df['date'] = pd.to_datetime(df['date'])

print(df, '\n')


#find the duplicated row in crit

print(df[df.duplicated('crit', keep=False)]['date'].min(), '\n')
print(df['date'] != df[df.duplicated('crit', keep=False)]['date'].min())

#apply 
df[df['date'] != df[df.duplicated('crit', keep=False)]['date'].min()]

enter image description here

Khalil Al Hooti
  • 4,207
  • 5
  • 23
  • 40