1

I've got a dataframe for which I am trying to verify an event based on others values in the dataframe. To be more concrete it's about UFO sightings. I've already grouped the df by date of sighting and dropped all rows with only one unique entry. The next step would be to check when dates are equal whether the city also is.

enter image description here

In this case I would like to drop all lines, as city is different.

enter image description here

I'd like to keep, as the event has got the same time and and the city is the same.

I am looking for way to do this for my entire dataframe. Sorry if that's a stupid question I'm very new to programming.

Psidom
  • 209,562
  • 33
  • 339
  • 356
Mark Wellings
  • 71
  • 1
  • 9
  • Why not group on date and city (and possibly state to avoid conflating cities with the same name in different states)? – 3novak Dec 25 '16 at 16:08
  • @3novak : I don't think that would help. I need a df which only consists of entries which are equal in datetime column and equal in city column. Good point though to consider states as well.. – Mark Wellings Dec 25 '16 at 16:21
  • Perhaps I don't understand, but `df.groupby(['datetime', 'city'])` produces a groupby object which contains dataframes where datetime and city are identical. You could find the number of rows for each combination for datetime and city to find if reports are corroborated or not. – 3novak Dec 25 '16 at 16:38
  • I get your point but in my grouped by dataframe I dont know how many times city was grouped.. If I know it's been grouped because of two cities I would know it's a valid entry – Mark Wellings Dec 25 '16 at 17:07
  • rather than posting screen shots of your data, it's better to generate an example of the problem that bootstraps representative data so that people can copy and paste into e.g., a jupyter notebook and start messing around. http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Paul H Dec 25 '16 at 17:39

2 Answers2

0

I don't think I'm understanding your problem, but I'll post this answer and we can work from there.

The corroborations column counts the number of times we have an observation with the same datetime and city/state combination. So in the example below, the 20th of December has three sightings, but two of those were in Portville, and the other was in Duluth. Thus the corroborations column for each event receives values of 2 and 1, respectively.

Similarly, even though we have four observations taking place in Portville, there two of them happened on the 20th, and the others on the 21st. Thus we group them as two separate events.

df = pd.DataFrame({'datetime': pd.to_datetime(['2016-12-20', '2016-12-20', '2016-12-20', '2016-12-21', '2016-12-21']), 
                   'city': ['duluth', 'portville', 'portville', 'portville', 'portville'], 
                   'state': ['mn', 'ny', 'ny', 'ny', 'ny']})

s = lambda x: x.shape[0]
df['corroborations'] = df.groupby(['datetime', 'city', 'state'])['city'].transform(s)

>>> df
    datetime        city state  corroborations
0 2016-12-20      duluth    mn               1
1 2016-12-20   portville    ny               2
2 2016-12-20   portville    ny               2
3 2016-12-21   portville    ny               2
4 2016-12-21   portville    ny               2
3novak
  • 2,506
  • 1
  • 17
  • 28
0

If you are just trying to remove duplicates of the combination of datetime, city and state then you can do the following which will keep the first row with first occurrence of each datetime, city and state combination.

df[df.duplicated(subset=['datetime', 'city', 'state']) == False]
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • I was actually trying to maintain the duplicated and get rid of the single entries. Setting it to true did the job, thanks! – Mark Wellings Dec 25 '16 at 17:56
  • Ok, make sure you read the documentation on `duplicated`. Look at the keep parameter to get what you want >>>keep : {'first', 'last', False}, default 'first' - ``first`` : Mark duplicates as ``True`` except for the first occurrence. - ``last`` : Mark duplicates as ``True`` except for the last occurrence. - False : Mark all duplicates as ``True``. – Ted Petrou Dec 25 '16 at 18:16