3

I have following dataframe -df :

                     crs         Band1 level
lat       lon                               
34.595694 32.929028  b''  4.000000e+00  1000
          32.937361  b''  1.200000e+01  950
          32.945694  b''  2.900000e+01  925
34.604028 32.929028  b''  7.000000e+00  1000
          32.937361  b''  1.300000e+01  950
                 ...           ...   ...
71.179028 25.679028  b''  6.000000e+01  750
71.187361 25.662361  b''  1.000000e+00  725
          25.670694  b''  6.000000e+01  1000
          25.679028  b''  4.000000e+01  800
71.529028 19.387361  b''  1.843913e-38  1000

[17671817 rows x 3 columns]

and two arrays:

lon1=np.arange(-11,47,0.25)
lat1=np.arange(71.5,34.5,-0.25)

These two arrays (lat1 , lon1 ) produce coordinate pairs spaced with 0.25 deg.

Dataframe df contains points (lat , lon ) which are densely spaced within points defined with lon1 and lat1 arrays. What I want to do is:

  1. find(filter) all points from df within 0.125 deg from points defined with lat1,lon1
  2. get max and min value of level from this subdataframe and store them in separate array same size as lon1 and lat1.

What I did so far is filter dataframe:

for x1 in lon1:
    for y1 in lat1:
        df3=df[(df.index.get_level_values('lon')>x1-0.125) & (df.index.get_level_values('lon')<x1+0.125)]
        df3=df3[(df3.index.get_level_values('lat')>y1-0.125) & (df3.index.get_level_values('lat')<y1+0.125)]

But this has very slow performance. I believe there is a faster one. I have tagged scikit-learn also since probably can be done with it, but I lack experience with this packate. Any help is appreceated.

user2727167
  • 428
  • 1
  • 3
  • 16
  • What is the relation between lon1 and lat1? is that cartesian multifaction (every value from lat1 with all values of lon1) or is it matching indices? (lon1[0] with lat1[0], same with index 1 and etc) – Roim Sep 26 '20 at 08:36
  • lon1 and lat1 are longitude and latitude of points. lon1[0], lat1[0] describes one point with longitude and latitude. – user2727167 Sep 26 '20 at 08:42
  • Maybe [this](https://stackoverflow.com/questions/26783719/efficiently-get-indices-of-histogram-bins-in-python) helps with the speed aspect; apparently sorting the data should help already – scleronomic Sep 26 '20 at 08:46
  • And one more question: do you want one sub-dataframe, or multi subdataframe (in size of the lon1 array), for each finding a max and min? I mean, do you want the max and min of all points within range, or max and min for the df of each pair? – Roim Sep 26 '20 at 08:52
  • What I want is: One value of min and one value of max from df within 0.125 deg space around (lon1,lat1) for each subdataframe. i.e. as output I need two more arrays(with max and min values) with same size as lon1. – user2727167 Sep 26 '20 at 09:02
  • 1
    link for snapshot of df1 (30k records). : https://1drv.ms/u/s!Aj5DfuTWg1OMjtskbyjp5bR-Ga6tbg?e=Zfq5oZ – user2727167 Sep 26 '20 at 09:20
  • @user2727167: Thanks for posting the 30k rows of sample data. Using my solution on those data takes 33 ms on a 5 year old laptop (2.2 GHz Core i7). Is that fast enough? – John Zwinck Sep 26 '20 at 11:14
  • this has great performance. Thanks! – user2727167 Sep 26 '20 at 12:10

3 Answers3

4

Before we start, let's convert your bins to be the start of each bin instead of the center:

lon1=np.arange(-11.125,47.125,0.25)
lat1=np.arange(71.625,34.125,-0.25)

Assign latitude and longitude bins for every row (note reversed order of lat1, otherwise you need to pass ordered=False to pd.cut()).

df['latcat'] = pd.cut(df.index.get_level_values(0), lat1[::-1])
df['loncat'] = pd.cut(df.index.get_level_values(1), lon1)

For your example data we now have:

                     crs         Band1  level            latcat            loncat
lat       lon                                                                    
34.595694 32.929028  b''  4.000000e+00   1000  (34.375, 34.625]  (32.875, 33.125]
          32.937361  b''  1.200000e+01    950  (34.375, 34.625]  (32.875, 33.125]
          32.945694  b''  2.900000e+01    925  (34.375, 34.625]  (32.875, 33.125]
34.604028 32.929028  b''  7.000000e+00   1000  (34.375, 34.625]  (32.875, 33.125]
          32.937361  b''  1.300000e+01    950  (34.375, 34.625]  (32.875, 33.125]
71.179028 25.679028  b''  6.000000e+01    750  (71.125, 71.375]  (25.625, 25.875]
71.187361 25.662361  b''  1.000000e+00    725  (71.125, 71.375]  (25.625, 25.875]
          25.670694  b''  6.000000e+01   1000  (71.125, 71.375]  (25.625, 25.875]
          25.679028  b''  4.000000e+01    800  (71.125, 71.375]  (25.625, 25.875]
71.529028 19.387361  b''  1.843913e-38   1000  (71.375, 71.625]  (19.375, 19.625]

Now use groupby to get the min and max level in each region:

res = df.groupby([df.latcat.cat.codes, df.loncat.cat.codes])['level'].agg(['min', 'max'])

Which gives you:

          min   max
0   176   925  1000
147 147   725  1000
148 122  1000  1000

The first level of the index is the position in the reversed lat1 array, with -1 meaning "out of range" which some of your example data are. The second level is the position in the lon1 array.

To convert to matrices as requested:

minlevel = np.full((len(lat1), len(lon1)), np.nan)
maxlevel = np.full((len(lat1), len(lon1)), np.nan)
x = len(lat1) - res.index.get_level_values(0) - 1 # reverse to original order
y = res.index.get_level_values(1)
minlevel[x, y] = res['min']
maxlevel[x, y] = res['max']
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
1

First let's review your solution: for each value in lon1 and for each value of lat1 (which is n^2 iterations in case they are in the size of n), you tried to filter the dataframe, which results in scanning the entire df: your code ran through the dataframe n^2 times, which is inefficient.

My solution requires to scan the dataframe only once, and for each scan it does n actions. It uses pandas apply function which is not very efficient, but I could not find a way to do so without it. I would love to hear a solution to filter without using apply.

I used a small reproducible example, you may need to adapt the indices to match to your code. I believe this example is much easier to follow.

import pandas as pd
import numpy as np

df = pd.DataFrame({"lat":[22.5, 10.76, 7.341, 22.5], "log":[3.64, 7.234, 135, 3.644], "level":[2, 8, 19, 9]})

lat1 = np.array([22.51, 7.33])
lon1 = np.array([3.6, 135.02])

The next lines create a list of tuples, each consist a pandas.Interval object. The tuple here is representing (lat1[i]+-x, lon1[i]+-x). Note to say, I did not have to use pandas.Interval - I could just build another tuple of (lat1[i]-x, lat1[i]+x). But I decided to go with pandas interval, does not really matter. The result: for each pair of [lat1, lon1], we have a tuple of two pandas interval, each is +-0.125

interval_list = []
const_add = 0.125
for i, item in enumerate(lat1):
    interval_list.append((pd.Interval(left=lat1[i]-const_add, right=lat1[i]+const_add),pd.Interval(left=lon1[i]-const_add, right=lon1[i]+const_add)))

Now we want to filter the dataframe. For using apply, I created a custom function: it checks if current row is within a tuple, and if so it returns the index in lat1 array (you'll see later why is it useful)

def within_range(row, interval_list):
    for i, item in enumerate(interval_list):
        if row[0] in item[0] and row[1] in item[1]:
            return i
    return np.nan

df["point"] = df.apply(lambda x: within_range(x, interval_list), axis=1)

At that point of the code, we have a column name 'point'. The value of it is as follows: if the row is close to point i (where i is the index in lat1[i] and lon1[i]), the value is i. If it has no close point, the value is nan.

Now all that is left is to just find the maximum and minimum for each point, which can be achieve easily using groupby:

max_series = df.groupby(by="point")["level"].max()
min_series = df.groupby(by="point")["level"].min()

You have two series where the index is same as the index in lat1 and lon[1]. You can convert them easily to array by using Series.array. It's worth to mention that you did not say how to handle missing values - if no point in df is close to point (lat1[50], lon1[50]), what is the value in the maximum and minimum array? That's why I leave it as a series, I believe it's easier to manipulate it before changing it to array.

The whole code together:

import pandas as pd
import numpy as np

df = pd.DataFrame({"lat":[22.5, 10.76, 7.341, 22.5], "log":[3.64, 7.234, 135, 3.644], "level":[2, 8, 19, 9]})

lat1 = np.array([22.51, 7.33])
lon1 = np.array([3.6, 135.02])

interval_list = []
const_add = 0.125

for i, item in enumerate(lat1):
    interval_list.append((pd.Interval(left=lat1[i]-const_add, right=lat1[i]+const_add),pd.Interval(left=lon1[i]-const_add, right=lon1[i]+const_add)))

def within_range(row, interval_list):
    for i, item in enumerate(interval_list):
        if row[0] in item[0] and row[1] in item[1]:
            return i
    return np.nan

df["point"] = df.apply(lambda x: within_range(x, interval_list), axis=1)
max_arr = df.groupby(by="point")["level"].max()
min_arr = df.groupby(by="point")["level"].min()
# or:
# max_arr = df.groupby(by="point")["level"].max().array
Roim
  • 2,986
  • 2
  • 10
  • 25
1

I used the trick described in this answer to efficiently get the indices of corresponding to bin in 1D and than loop over the groups for lon and lat to get the intersections of both. I use numpy here and don't directly apply min / max but focus on the indices.

import numpy as np
from scipy.sparse import csr_matrix

def digitize_group(x, bins):
    idx_x = np.digitize(x, bins)
    n, m = len(x), len(bins) + 1
    s = csr_matrix((np.arange(n), [idx_x, np.arange(n)]), shape=(m, n))
    return [group for group in np.split(s.data, s.indptr[1:-1])]

# Create dummy data
n = 100000  # 17671817
step = 0.25  # Note the shift by step/2 to transform your arrays to bins
bins_lon = np.arange(-11-step/2, 47+step/2, step) 
bins_lat = np.arange(71.5+step/2, 34.5-step/2, -step)
lon = np.random.uniform(low=bins_lon.min(), high=bins_lon.max(), size=n)
lat = np.random.uniform(low=bins_lat.min(), high=bins_lat.max(), size=n)

# Get the 1D groups
group_lon = digitize_group(lon, bins_lon)
group_lat = digitize_group(lat, bins_lat)

# Combine to 2D groups
group_lonlat = np.zeros((len(group_lon), len(group_lat)), dtype=object)
for i, lo in enumerate(group_lon):
    for j, la in enumerate(group_lat):
        group_lonlat[i, j] = np.intersect1d(lo, la, assume_unique=True)

print(group_lonlat[13, 17])
# array([   15606,   131039,   168479,   171734,   174281,   266717,   ....

By accessing group_lonlat[i, j] you get a list of indices K where each element k fulfils:

bins_lon[i] < lon[k] < bins_lon[i+1] & bins_lat[j] < lat[k] < bins_lat[j+1]

Whith those indices you than can access your dataframe and perform all further computations.


One my laptop it took 180s to calculate the indices for n=17671817.

One bottle neck of this approach is the suboptimal handling of the intersection search. sortednp promises to do better than numpy here. And for large n it is more efficient to delete the used indices to speed up the search.

import sortednp as snp
for i in range(len(group_lon)):
    for j in range(len(group_lat)):
        group_lonlat[i, j], (ii, jj) = snp.intersect(group_lon[i], group_lat[j], 
                                                     indices=True)
        group_lon[i] = np.delete(group_lon[i], ii)
        group_lat[j] = np.delete(group_lat[j], jj)

which brings us down to around 20s for n=17671817 and 0.3s for n=30000 .

scleronomic
  • 4,392
  • 1
  • 13
  • 43