-1

I've a df with 950 rows in it. Let's pretend that the columns are timestamp, quantity, event, file. This is a good approximation of df. I want to:

  • select all rows where event is this_event and file is this_file
  • and drop the rows if the row has the same timestamp as a row where file is my_file and the quantity's match.

How do I do that? Really struggling. I don't know how to manage this.

EDIT:

Example data:

timestamp,  event,  quantity,   file
2018-10-17 02:01:00,    slept,  7,  base
2018-10-17 02:01:00,    slept,  7,  temp
2018-10-17 02:01:00,    slept,  9,  base
2018-10-17 02:04:00,    studied,    5,  temp
2018-10-17 02:04:00,    farted, 7,  temp
2018-10-17 02:04:00,    drank,  1,  base
2018-10-17 02:04:00,    exercised,  8,  base
2018-10-17 02:04:00,    slept,  7,  base

So for example I will always keep records that pertain from file base. This is a bias I want to keep as these records cannot be removed. I want to delete any record from any other file that isn't base, e.g. here temp, where the timestamp and event is the same as any of those relating to base (at the same timestamp) but only when the quantity is the same as an entry from base (at the same timestamp).

So in this example data I would expect the code to identify the 2nd entry down and remove this because the quantity 7 is the same as one of the two other's of base.

The code would not delete anything from 02:04:00 because there are no rows with the same timestamp and event strings (events are all unique).

uncle-junky
  • 723
  • 1
  • 8
  • 33

1 Answers1

0

I think there are many way to solve it. here is my sample code for your reference.

import pandas as pd
from pandas.compat import StringIO
from pprint import pprint

text = """
timestamp,event,quantity,file
2018-10-17 02:01:00,slept,7,base
2018-10-17 02:01:00,slept,7,temp
2018-10-17 02:01:00,slept,9,base
2018-10-17 02:04:00,studied,5,temp
2018-10-17 02:04:00,farted,7,temp
2018-10-17 02:04:00,drank,1,base
2018-10-17 02:04:00,exercised,8,base
2018-10-17 02:04:00,slept,7,base
"""

df =pd.read_csv(StringIO(text))
print("original dataframe is")
pprint(df)

qty = 7
my_file = 'temp'
my_event = 'slept'
cols =["event","quantify","file"]
qty_check = grp["quantity"].transform(lambda x: x==qty)
evt_check = grp["event"].transform(lambda x: x==my_event)

file_check = grp["event"].transform(lambda x: x==my_file)
combined_check = evt_check & qty_check & file_check
print("\ndropped dataframe is")   
pprint(df[combined_check])

psl refer to pandas groupby/transform section as well. http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#transformation note: you may find the filter functions as well. but as far as I know i can filter groups instead of the rows in groups

Yong Wang
  • 1,200
  • 10
  • 15
  • You're only filtering out lines with 'slept' and 'base'. This does not include the logic of what I described. The code would need `import datetime` in there to compare `timestamp` values, for example. Appreciate the effort though, just isn't what I'm looking for. – uncle-junky Apr 26 '19 at 03:44
  • Yes. The original one not catch all logical. I updated already. – Yong Wang Apr 26 '19 at 06:23
  • Is it possible to update the code with `groupby()` on `timestamp` and then perform the `combined_check`? – uncle-junky Apr 27 '19 at 05:34