-1

I have a Pandas DataFrame comprising of records of tropical cyclone (TC) position, intensity, and timing. It's arranged as such:

Pertinent pieces of data here include the CY which is essentially a unique identifier number for the TC in the basin that year, and the YYYYMMDDHH column, which is the date that the observation is being entered.

I wish to determine for with dates the storm in question exists in the database 72 hours afterward. Currently I'm attempting to do something of the following sort:

deck.loc[((deck['BASIN'] == deck['BASIN']) & (deck['CY'] == deck['CY']) 
      & (len(deck['YYYYMMDDHH'] == (deck['YYYYMMDDHH'] + datetime.timedelta(hours=72)))) > 0)]

However, this isn't giving me the boolean series I desire for the indexing. I've also considered doing things row-by-row, but know that this goes against the Pandas philosophy.

SlushP22
  • 17
  • 6
  • 1
    is the question that you want to know which storms lasted at least 72 hours? – Chris Jul 19 '23 at 17:44
  • 1
    Sorry - I think I was a little unclear in the initial post. For each time in the YYYYMMDDHH column I wish to determine if the given storm exists in the database >=72 hours later. – SlushP22 Jul 19 '23 at 17:47
  • Please read [ask] for tips like how to write a good title. – wjandrea Jul 19 '23 at 17:58
  • [Don't post pictures of text](//meta.stackoverflow.com/q/285551/4518341). Instead, post the text itself and use the formatting tools like [code formatting](/editing-help#code). You might need to use `print(df)` to get the alignment right. – wjandrea Jul 19 '23 at 18:00
  • Lastly, please make a [mre] including minimal example data and desired output. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). – wjandrea Jul 19 '23 at 18:01
  • For the title, you might do something like, "How to find elements in a df that are still present a certain time later?" – wjandrea Jul 19 '23 at 18:10

2 Answers2

1

Since you posted data as an image, I recreated a simple example.

We're checking if the last time per storm minus the current time is greater to or equal 72 hours

df = pd.DataFrame({
    'storm': ['A','B'],
    'time': [pd.date_range(periods=75, freq='H', start='01-01-2022 00:00:00'),
             pd.date_range(periods=10, freq='H', start='01-01-2022 00:00:00')]
    }).explode('time')

(
    (df.groupby('storm')['time'].transform('max') - df['time'])
    .dt.total_seconds() / 3600
).ge(72)
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Chris
  • 15,819
  • 3
  • 24
  • 37
1

Using the xarray package (which is basically pandas for multidimensional data) you could first represent your dataset as an xarray.Dataset object ds with a single time dimension, and assigning basin and CY as coordinates along that dimension. I think everything else would be a data variable.

Then your operation might be expressible as something like this:

duration = ds.time.groupby(['CY', 'basin']).sum(dim='time')
persistent_storms = ds.where(duration > datetime.timedelta(hours = 72))
ThomasNicholas
  • 1,273
  • 11
  • 21