How to find elements in a Pandas DataFrame that are still present a certain time later

Question

I have a Pandas DataFrame comprising of records of tropical cyclone (TC) position, intensity, and timing. It's arranged as such:

Pertinent pieces of data here include the CY which is essentially a unique identifier number for the TC in the basin that year, and the YYYYMMDDHH column, which is the date that the observation is being entered.

I wish to determine for with dates the storm in question exists in the database 72 hours afterward. Currently I'm attempting to do something of the following sort:

deck.loc[((deck['BASIN'] == deck['BASIN']) & (deck['CY'] == deck['CY']) 
      & (len(deck['YYYYMMDDHH'] == (deck['YYYYMMDDHH'] + datetime.timedelta(hours=72)))) > 0)]

However, this isn't giving me the boolean series I desire for the indexing. I've also considered doing things row-by-row, but know that this goes against the Pandas philosophy.

is the question that you want to know which storms lasted at least 72 hours? — Chris, Jul 19 '23 at 17:44
Sorry - I think I was a little unclear in the initial post. For each time in the YYYYMMDDHH column I wish to determine if the given storm exists in the database >=72 hours later. — SlushP22, Jul 19 '23 at 17:47
[Don't post pictures of text](//meta.stackoverflow.com/q/285551/4518341). Instead, post the text itself and use the formatting tools like [code formatting](/editing-help#code). You might need to use `print(df)` to get the alignment right. — wjandrea, Jul 19 '23 at 18:00
Lastly, please make a [mre] including minimal example data and desired output. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). — wjandrea, Jul 19 '23 at 18:01
For the title, you might do something like, "How to find elements in a df that are still present a certain time later?" — wjandrea, Jul 19 '23 at 18:10

score 1 · Accepted Answer · edited Jul 19 '23 at 18:09

Since you posted data as an image, I recreated a simple example.

We're checking if the last time per storm minus the current time is greater to or equal 72 hours

df = pd.DataFrame({
    'storm': ['A','B'],
    'time': [pd.date_range(periods=75, freq='H', start='01-01-2022 00:00:00'),
             pd.date_range(periods=10, freq='H', start='01-01-2022 00:00:00')]
    }).explode('time')

(
    (df.groupby('storm')['time'].transform('max') - df['time'])
    .dt.total_seconds() / 3600
).ge(72)

ThomasNicholas · Answer 2 · 2023-07-19T18:26:13.397

1

Using the xarray package (which is basically pandas for multidimensional data) you could first represent your dataset as an xarray.Dataset object ds with a single time dimension, and assigning basin and CY as coordinates along that dimension. I think everything else would be a data variable.

Then your operation might be expressible as something like this:

duration = ds.time.groupby(['CY', 'basin']).sum(dim='time')
persistent_storms = ds.where(duration > datetime.timedelta(hours = 72))

edited Jul 19 '23 at 18:26

answered Jul 19 '23 at 17:56

ThomasNicholas

1,273
11
21

What's `ds`? I've never used xarray. – wjandrea Jul 19 '23 at 18:12
1

`ds` conventionally refers to an instance of an `xarray.Dataset` object. It's like a multidimensional `pandas.DataFrame`. – ThomasNicholas Jul 19 '23 at 18:25

How to find elements in a Pandas DataFrame that are still present a certain time later

2 Answers2