How can I quickly filter a large dataset?

Question

I have a large data set with global latitudes and longitudes. However, I am only interested in looking at a specific region, so I want to filter out all lat/lons that are outside of this region. The problem is that I am using if statements to parse the data, however, this takes too long. Is there a faster way to accomplish this?

The data comes from a netCDF file, and can be stored in a dictionary. I only want latitudes between 10 degrees N and 80 degrees North, and longitude between -170 degrees and -50 degrees. Here is what I have tried so far:

ret_dict = {}
with Dataset(filename,'r') as fid:
    ret_dict['time'] = fid.variables['timeObs'][:]
    sort_order = np.argsort(ret_dict['time'])
    lat1 = [i for i in fid.variables['latitude'][:][sort_order] if fid.variables['latitude'][:][sort_order] > 10 ]
    lat2 = [i for i in lat1 if lat1 < 80]

The above code can be repeated for longitudes. However, this is too slow with my large amount of data. It also doesn't give me the indices so that I make sure I keep the original latitude and longitude pairs. How can I quickly truncate the data for all variables?

EDIT: The answer below is correct for the first part of the question, however I am also trying to truncate other variables using the indices of the filtered latitude. I am trying:

lon = [j for i,(j,i) in zip(fid.variables['longitude'][:],fid.variables['longitude']) if 10<i<80]

However I am getting the error: ***TypeError: 'numpy.float32' object is not iterable

This has been answered before: http://stackoverflow.com/questions/29135885/netcdf4-extract-for-subset-of-lat-lon/35320631#35320631 — N1B4, Jun 15 '16 at 16:56

Brian · Accepted Answer · 2016-06-15T15:23:38.310

0

Is there any reason you need your data sorted? Sorting is an expensive operation O(nlogn) whereas filtering is just O(n). If sorting isn't a requirement, you can do it like so with 1 filtering operation. (Keep in mind, I don't know your data, so you may need to modify this a bit)

lat = [i for i in fid.variables['latitude'][:] if 10 < i < 80 ]

This is the fastest way I can figure with the limited amount of information given. If this is still too long, give more information so that we can try to help you even more :)

EDITED

edited Jun 15 '16 at 15:23

answered Jun 15 '16 at 15:17

Brian

1,659
12
17

I am getting the error: ****ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). I am sorting the data so that the lat/lon pairs remain together. Is this not necessary? – manateejoe Jun 15 '16 at 15:21
I made a sloppy mistake and edited to fix it. Sorry. Also, you most likely do not need to sort so that the pairs remain together. You can verify this on your own. – Brian Jun 15 '16 at 15:24
I made an edit to further clarify the second part of my question – manateejoe Jun 15 '16 at 16:01

How can I quickly filter a large dataset?

1 Answers1