1

I'm trying to change the first observation for each unique ID in a dataframe to an NaN. I'm working with a dataframe of timestamps and coordinate points that are already sorted by unique ID and timestamp.

Sample:

    ID        timestamp   latitude  longitude
0    1   6/9/2017 11:20  38.795333  77.008883
1    1   6/9/2017 13:10  38.889011  77.050061
2    1   6/9/2017 16:23  40.748249  73.984191
3    2  6/11/2017 08:35  38.920602  77.222329
4    2  6/11/2017 10:00  42.366211  71.020943
5    2  6/11/2017 20:00  38.897416  77.036833
6    2  6/12/2017 07:30   38.851426  77.042298
7    2  6/12/2017 10:20  38.917346  77.222553
8    3  6/11/2017 09:01  40.782869  73.967544
9    3  6/11/2017 10:03  38.954268  77.449695
10   3  6/11/2017 11:48  38.872875  77.007763
11   3  6/12/2017 11:52  40.776931  73.876155

Attempt:

df['latitude'] =\
 df.groupby('ID')['latitude'].apply(lambda x: x[0].np.nan)
df['longitude'] =\
 df.groupby('ID')['longitude'].apply(lambda x: x[0].np.nan)

I suspected that although the df was already grouped, I would still need to use a groupby to operate by each unique ID. I'm having trouble thinking of how to access each first value, and then replacing them as NaN.

Which gives the error: KeyError: 0

Here's the desired output:

        ID        timestamp   latitude  longitude
    0    1   6/9/2017 11:20        NaN       NaN
    1    1   6/9/2017 13:10  38.889011  77.050061
    2    1   6/9/2017 16:23  40.748249  73.984191
    3    2  6/11/2017 08:35        NaN       NaN
    4    2  6/11/2017 10:00  42.366211  71.020943
    5    2  6/11/2017 20:00  38.897416  77.036833
    6    2  6/12/2017 07:30   38.851426  77.042298
    7    2  6/12/2017 10:20  38.917346  77.222553
    8    3  6/11/2017 09:01        NaN       NaN
    9    3  6/11/2017 10:03  38.954268  77.449695
    10   3  6/11/2017 11:48  38.872875  77.007763
    11   3  6/12/2017 11:52  40.776931  73.876155

Edit (Why do this?):

I'm trying to adapt a version of this this answer to calculate distance and velocity. Everything works great except that the first values of each lat/lon for each value are wrong because the function calculates on the rows, indiscriminate of the ID. Looking at different solution, I suspect I need to something similar to this... calculating for both the velocity and distance by using a concat and shift. This is kind of difficult for me to conceptualize though - so figured just replacing those values would be simpler than editing and rerunning - which why is I posed the question.

champezius
  • 21
  • 4
  • 2
    Why do you want to do this? – ifly6 Aug 09 '19 at 21:15
  • Would you mind providing code to construct your dataframe? – Moormanly Aug 09 '19 at 21:23
  • 1
    @Moormanly if you make variable `s` as a copy of the text at the top, you can then parse the whole data frame by invoking `pd.read_csv(io.StringIO(s), sep=' +', engine='python')` (n.b. `sep` is two spaces, then a plus). – ifly6 Aug 09 '19 at 21:28
  • 2
    Agree with the comments above. It's possible to achieve this by having a mask and then use `where` to generate `NaN`, but I have no clue what's the motivation behind this. – Mark Wang Aug 09 '19 at 21:36
  • I've edited to provide some more context. Thanks to all for your help. – champezius Aug 09 '19 at 21:48
  • That's a reasonable reason to do this. But, to clarify, do you want the velocity between each time stamp or the average velocity over the whole? – ifly6 Aug 09 '19 at 21:51
  • I want the former. What I've done is `df['timeDiff'] = df['timestamp'].diff()`and then adapting the velocity function slightly from first answer (within the edit)... adding the difference to the original timestamp (thus getting an end time). Again this works for me overall - except that the first entries for each ID are incorrect because these are originating points, calculating on previous coordinates from a different ID. – champezius Aug 09 '19 at 22:08

2 Answers2

1

Since your df is already sorted by the ID column, you can use the following trick to get the first occurrence of each unique ID as a boolean mask:

mask = df.ID != df.ID.shift()

Then set the corresponding data to NaN

df.loc[mask, ['latitude', 'longitude']] = np.nan
Moormanly
  • 1,258
  • 1
  • 9
  • 20
  • 1
    This is heavily dependent on the order of the rows. Oh whoops, your edit beat me to it. – ifly6 Aug 09 '19 at 21:43
  • I will elaborate to mention that. OP mentioned their IDs are sorted though, so it should be okay for them at least. – Moormanly Aug 09 '19 at 21:44
0

Edit:

Stealing a bit from Moormanly, one can turn it into a oneliner:

df.loc[df.groupby('ID').head(1).index,
       ['longitude', 'latitude']] = float('nan')

This should do the trick:

indices = df.groupby('ID').head(1).index
df.loc[indices, 'latitude'] = float('nan')
df.loc[indices, 'longitude'] = float('nan')

Result:

       ID        timestamp   latitude  longitude
index                                           
0       1   6/9/2017 11:20        NaN        NaN
1       1   6/9/2017 13:10  38.889011  77.050061
2       1   6/9/2017 16:23  40.748249  73.984191
3       2  6/11/2017 08:35        NaN        NaN
4       2  6/11/2017 10:00  42.366211  71.020943
5       2  6/11/2017 20:00  38.897416  77.036833
6       2  6/12/2017 07:30  38.851426  77.042298
7       2  6/12/2017 10:20  38.917346  77.222553
8       3  6/11/2017 09:01        NaN        NaN
9       3  6/11/2017 10:03  38.954268  77.449695
10      3  6/11/2017 11:48  38.872875  77.007763
11      3  6/12/2017 11:52  40.776931  73.876155

sobek
  • 1,386
  • 10
  • 28