0

I've 2 data frames of unequal length df and df_IGR both of which have columns Lat and Lon which I want to compare.

When I do element-wise comparison such that I'm able to extract any column values from either of the two df's for which the difference in Lat and Lon between of the two df's is less than 0.005

for row in df_IGR.itertuples():
    print(df.index[((df.Lat - row.Lat) < 0.005) & ((df.Lon - row.Lon) < 0.005)])

This Works but when I'm trying to use the same condition in an if loop, it doesn't work.

for row in df_IGR.itertuples():
    count = 0
    if (((df.Lat - row.Lat) < 0.0005) & ((df.Lon - row.Lon) < 0.0005)):
        print (row.Name)
        count = count + 1
print (count)
>>ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Can anyone explain why it's not working? And how can I make the "if loop" working here? I already went through this answer but I don't understand what I need to change. https://stackoverflow.com/a/36922103/7763326

Edit: the data looks like

df.head()
>>
                 back_hr    Lat      Lon        AGL   Pressure
BT_time                 
1998-01-01 11:15:00  0.0    75.197  -150.045    1000.0  900.8
1998-01-01 11:00:00 -0.2    75.245  -150.001    1002.4  900.5
1998-01-01 10:00:00 -1.2    75.447  -149.863    1011.2  899.2
1998-01-01 09:00:00 -2.2    75.661  -149.785    1019.3  898.3
1998-01-01 08:00:00 -3.2    75.887  -149.762    1028.1  897.1

df_IGR.head()
>>
    ID          Lat     Lon   Elevation State   Name    FirstYr LastYr  NObs
193 BEM0000644  50.796  4.358   99       NaN    UCCLE   1949    2016    42564
195 BEM0000647  50.033  5.400   558      NaN    ST-HUBE 1964    2006    26357
241 BOM0002685  53.933  27.633  231      NaN    MINSK   1939    1998    43822
242 BOM0003300  52.116  23.683  144      NaN    BREST   1951    2007    40167
244 BOM0003304  52.401  30.963  126      NaN    GOMEL   1970    2017    32410
Light_B
  • 1,660
  • 1
  • 14
  • 28
  • You need to move `count = 0` out of the loop, `&` is also bitwise in python, you need `and`. Lastly what is `df.Lat` do you not need to loop over each row of `df` as well as `df_IGR`? – Nick is tired Feb 12 '18 at 11:16
  • Could you provide some data? – Cleb Feb 12 '18 at 11:16
  • @Nick added the data for clarification. I want to take one Lat and Lon value from df_IGR and check the condition. Where it satisfies, I would like to print df_IGR.Name for that Lat and Lon – Light_B Feb 12 '18 at 11:25
  • @Cleb Edited for the data – Light_B Feb 12 '18 at 12:56

2 Answers2

1

df.Lat is a Series.

So is df.Lat - row.Lat.

So is (df.Lat - row.Lat) < 0.0005.

So is ((df.Lat - row.Lat) < 0.0005) & ((df.Lon - row.Lon) < 0.0005).

That last Series is a boolean Series that tells you which points in df are close enough to this one row in df_IGR. And this whole Series cannot just be True or False - this is what the error is saying.

I am not quite sure what exactly you are trying to achieve, so I can only explain the error.

If, however, you are trying to count the number of pairs of points that are close enough, you can do a nested loop:

count = 0
for row_IRG in df_IGR.itertuples():
    for row in df.itertuples():
        if (((row_IRG.Lat - row.Lat) < 0.0005) & ((row_IRG.Lon - row.Lon) < 0.0005)):
        print (row.Name, row_IRG.Name)
        count = count + 1
print (count)
Evgenii
  • 335
  • 2
  • 11
  • Yes, I would like to get row_IGR.Name and some other column values from df_IGR when the condition satisfies. However, there is a semantic error coming up with your approach which I can't figure out (assuming the print statement is inside the IF Loop). If I do print(row_IRG.Name, row_IRG.Lat, row_IRG.Lon ,row.Lat, row.Lon) EDIT: The first line of Output is: UCCLE 50.796 4.358 78.793 179.807 *Clearly it's giving the output where the Lat-Lon values in the 2 df's are not matching* – Light_B Feb 12 '18 at 12:50
  • I got it. The error is because I should use a modulus operator like condition. Thanks! – Light_B Feb 12 '18 at 13:00
0

Here's a readable one line answer. I've just added some data to make it more obvious what's happening. With the resulting Boolean arrays, you can apply any or all functions as you wish.

Lat1 = [3, 1, 4, 6, 1, 4, 9]  # df_IGR.Lat.values
Lat2 = [6, 7, 4, 1, 3]        # df.Lat.values

Lon1 = [5, 4, 5, 6, 8, 3, 9]  # df_IGR.Lon.values
Lon2 = [4, 2, 1, 6, 5]        # df.Lon.values

comparison = [[(i - j < 3) & (k - l < 3) for i, k in zip(Lat1, Lon1)] \
                                         for j, l in zip(Lat2, Lon2)]

# [[True, True, True, True, False, True, False],
#  [False, True, False, False, False, True, False],
#  [False, False, False, False, False, True, False],
#  [True, True, False, False, True, False, False],
#  [True, True, True, False, False, True, False]]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • I initially approached it with zip but it crashed my system since it's zipping 2 huge df's. Here is that question I asked for that https://stackoverflow.com/q/47704647/7763326 – Light_B Feb 12 '18 at 11:28
  • 1
    you know you can feed pd.Series directly into zip to avoid creating large lists, e.g. `zip(df_IGR.Lat, df_IGR.Lon)`. This way you aren't creating new lists / objects. `pd.Series` are iterable, just like lists. – jpp Feb 12 '18 at 11:46