3

I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.

I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :

rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX

Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?

The output from my command looks like this :

1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)

What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.

Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.

David
  • 605
  • 2
  • 14
  • 44
  • 1
    Can you explain how do you define a folder "the latest data is being written to"? Its changes every day/hour that it is unknown? – Marcin Jul 10 '21 at 11:05
  • Yes, every 5-9 minutes the incoming data chooses a new folder to write to. – David Jul 18 '21 at 07:09

1 Answers1

1

If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:

import boto3

s3_resource = boto3.resource('s3')

objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')

date_key_list = [(object.last_modified, object.key) for object in objects]

print(len(date_key_list)) # How many objects?

date_key_list.sort(reverse=True)

print(date_key_list[0][1])

Output:

43727
KEWX/125/20200912-071306-065-I

It takes a while to go through those 43,700 objects!

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Hi John, I have accepted your answer because it was very helpful. However you are correct when saying it takes a while to list the objects. I am trying to figure out a quick way to do this since this is live weather radar data. This data is part of the AWS/NOAA partnership and somehow there must be a way to find out which directory is the "working" one! I can't see other people using the data without somehow knowing. – David Sep 12 '20 at 17:28
  • It looks like something called `RadarServer` might help you identify the files to use: [Using Python to Access NCEI Archived NEXRAD Level 2 Data](https://nbviewer.jupyter.org/gist/dopplershift/356f2e14832e9b676207) – John Rotenstein Sep 12 '20 at 22:12
  • I appreciate the help, however that data is to access completed volume scans. I still have not figured out how to find out which folder is currently the one to use after much time searching. It seems this is a very important piece of information to have no documentation on, very confusing. – David Sep 14 '20 at 02:50
  • Isn't it possible to use the --max-age (https://rclone.org/filtering/#max-age-don-t-transfer-any-file-older-than-this) and --dry-run / -v parameters to achieve what you want? if you set the --max-age to lets the update interval of the file or 1D. if you want things to be exact, you could use this technique to filter most of the data / save time in the process (and then use the above (accepted) script to get the exact directory. This avoids you hopefully to make so many HEAD requests – jcuypers Jul 14 '21 at 08:22