3

I am using Pandas and PyProj to convert eastings and northing to longitutde and latitude and then save the split output into 2 columns like this....

v84 = Proj(proj="latlong",towgs84="0,0,0",ellps="WGS84")
v36 = Proj(proj="latlong", k=0.9996012717, ellps="airy",
        towgs84="446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894")
vgrid = Proj(init="world:bng")


def convertLL(row):

    easting = row['easting']
    northing = row['northing']

    vlon36, vlat36 = vgrid(easting, northing, inverse=True)

    converted = transform(v36, v84, vlon36, vlat36)

    row['longitude'] = converted[0]
    row['latitude'] = converted[1]

    return row


values = pd.read_csv("values.csv")
values = values.apply(convertLL, axis=1)

This is working but is very slow and times out on larger datasets. In an effort to improve things I am trying to convert this to use a lamba function instead in the hopes that will speed things up. I have this so far...

def convertLL(easting, northing):

    vlon36, vlat36 = vgrid(easting, northing, inverse=True)

    converted = transform(v36, v84, vlon36, vlat36)

    row = row['longitude'] = converted[0]

    return row


values ['longitude'] = values.apply(lambda row: convertLL(row['easting'], row['northing']), axis=1)

This converted version is working and is faster than my old one and does not time out on larger datasets, but this only works for the longitude, is there a way to get it to do latitude as well?

Also, is this vectorized? Can I speed things up any more?

EDIT

A sample of data...

name | northing | easting | latitude | longitude
------------------------------------------------
tl1  | 378778   | 366746  |          |
tl2  | 384732   | 364758  |          |
roganjosh
  • 12,594
  • 4
  • 29
  • 46
fightstarr20
  • 11,682
  • 40
  • 154
  • 278
  • 1
    Can you give us the output of `df.head()` so that I have something to play with? – roganjosh May 26 '20 at 10:03
  • I have updated the post with a sample, is this enough? – fightstarr20 May 26 '20 at 10:18
  • 1
    Sorry, I got called away so not had a chance to look at it. I originally thought "well, we can probably do away with all those function calls to PyProj and implement a vectorized version" and then I found [this](https://stackoverflow.com/a/344083/4799172) which really puts me off trying that approach :P – roganjosh May 26 '20 at 17:13
  • Yeah PyProj seems suited to the task, I looked at calculating from scratch and quickly changed my mind :) – fightstarr20 May 26 '20 at 17:25
  • Eyeballing it, we might have a reasonable shot of pushing that into numpy. I'll give it a go – roganjosh May 26 '20 at 17:35
  • 1
    Aha, I know how we can do this now. `transform` takes array inputs already. Please show your imports (for `vgrid`) and where are `v36` and `v84` defined so I can make a reproducible test? – roganjosh May 26 '20 at 17:53
  • 1
    Have updated the op – fightstarr20 May 26 '20 at 18:45

1 Answers1

3

Because of the subject matter, I think we couldn't see the wood for the trees. If we look at the docs for transform you'll see:

  • xx (scalar or array (numpy or python)) – Input x coordinate(s).
  • yy (scalar or array (numpy or python)) – Input y coordinate(s).

Great; the numpy array is exactly what we need. A pd.DataFrame can be thought of as a dictionary of arrays, so we just need to isolate those columns and pass them to the function. There's a tiny catch - columns of a DataFrame will be a Series, which transform will reject, so we just need to use the values attribute. This mini example is directly equivalent to your initial approach:

def vectorized_convert(df):
    vlon36, vlat36 = vgrid(df['easting'].values, 
                           df['northing'].values, 
                           inverse=True)
    converted = transform(v36, v84, vlon36, vlat36)
    df['longitude'] = converted[0]
    df['latitude'] = converted[1]
    return df

df = pd.DataFrame({'northing': [378778, 384732],
                   'easting': [366746, 364758]})

print(vectorized_convert(df))

And we're done. With that aside, we can look to timings for 100 rows (the current approach explodes for my usual 100,000 rows for timing examples):

def current_way(df):
    df = df.apply(convertLL, axis=1)
    return df


def vectorized_convert(df):
    vlon36, vlat36 = vgrid(df['easting'].values, 
                           df['northing'].values, 
                           inverse=True)

    converted = transform(v36, v84, vlon36, vlat36)
    df['longitude'] = converted[0]
    df['latitude'] = converted[1]
    return df


df = pd.DataFrame({'northing': [378778, 384732] * 50,
                   'easting': [366746, 364758] * 50})

Gives:

%timeit current_way(df)
289 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vectorized_convert(df)
2.95 ms ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
roganjosh
  • 12,594
  • 4
  • 29
  • 46
  • 1
    Looks really good, give me some time to digest everything and I will come back. Thanks so much! – fightstarr20 May 26 '20 at 20:44
  • I've spent some time trying to implement but I am unsure on how to pass through my dataframe to the function. In your example you are specifying the values but how do I get it to process each row without using apply? Or is that the point, now it is vectorized we don't need to use apply? – fightstarr20 May 27 '20 at 10:07
  • 1
    @fightstarr20 the point is to avoid using `apply`. If you run the first code snippet you will see that _both_ rows are populated with `latitutde` and `longitude` values in a single function call – roganjosh May 27 '20 at 10:09
  • 1
    @fightstarr20 that is the nature of vectorized operations - they act on arrays as though they were scalars, so we don't need to iterate through rows (which is slow). PyProj appears to make heavy use of Cython, so it's a codebase that gets compiled down to C++. We want to pass arrays and have it work on all the values at one time, which may be able to make use of things like BLAS/LAPACK and SIMD. `apply` will default to a python `for` loop, which has loads of overhead. Pass the whole df to the function – roganjosh May 27 '20 at 10:12
  • 1
    Understood, thanks for clearing that up, have managed to get it running now. The speed increase is incredible! No more timeouts, can't thank you enough for your help on this! – fightstarr20 May 27 '20 at 10:19
  • 1
    @fightstarr20 You're very welcome. I would like to rename your question to make it more specific to the topic, though, if that's ok? I think it's too broad to be useful to others – roganjosh May 27 '20 at 10:28