Speed up BeautifulSoup parsing?

Question

I need to process weather data from this website (https://www.ftp.ncep.noaa.gov/data/nccf/com/gfs/prod/gfs.20190814/06/), each file is around 300MB. Once I download the file, I only need to read in a subset of it. I think that downloading it is going to be too slow, so I was going to use BeautifulSoup to read in the data directly from the website, like this

from bs4 import BeautifulSoup
import requests

url = 'https://www.ftp.ncep.noaa.gov/data/nccf/com/gfs/prod/gfs.20190814/06/gfs.t06z.pgrb2.0p25.f000'
response = requests.get(url)
soup = BeautifulSoup(response.content, features='lxml')

And then using the pygrib library to read in a subset of the resulting .grib (a weather data format) file. However, this also proves to be too slow, taking approx 5 minutes for something that will need to be done 50 times a day. Is there some faster alternative I am not thinking of?

What data are you trying to get, if the issue is downloading the file,you'll have to narrow down your search in the noaa api. If you can't do that you're probably stuck with downloading the large file. Or scraping the data in a completely different way. — Mason Caiby, Aug 14 '19 at 18:35
Yep, the info is a subset of the file, which I could get by using a pygrib command to parse that subset, but only full files are available — Preethi Vaidyanathan, Aug 14 '19 at 18:36
Does the data live somewhere on the noaa site? could you scrap it directly from the site, not using the full file? E.g. navigate to the weather in San Francisco and find the temperature they're displaying on the page. This way you don't have to download the file. Otherwise, you can refer to this question and possibly grab specific bytes of the file you want, assuming the data location doesn't change: https://stackoverflow.com/questions/1798879/download-file-using-partial-download-http — Mason Caiby, Aug 14 '19 at 18:39

Steve Barnes · Accepted Answer · 2019-08-16T10:36:02.813

What you can do is to download the matching .idx file which gives you the offsets & sizes within the main file. You can then identify the parts of the file that you need and use the techniques mentioned in the accepted answer to Only download a part of the document using python requests to just get those bits.

You may need to do some additional processing to be able to read it using pygrib the simplest option may be to download the file header and the bits that you are interested in and combine them into a single file with padding where you are not interested.

BTW you don't need the Beautiful Soup processing at all! The content section of the requests.get response is the data that you are after.

Additional Information:

From the comments:

For anyone who comes across this in the future, for grib files, here is a working outline of this concept that I found: gist.github.com/blaylockbk/… – P.V.

For anyone who comes across this in the future, for grib files, here is a working outline of this concept that I found: https://gist.github.com/blaylockbk/39d2c5244b988706d9c51dd9fd514650#file-download_hrrr_variable_from_pando-py — Preethi Vaidyanathan, Aug 15 '19 at 20:51

Speed up BeautifulSoup parsing?

1 Answers1

Additional Information: