2

I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert this.how i can execute this in distributed manner.

    df = sdf.toPandas()
    gdf = gpd.GeoDataFrame(
    df.drop(['Longitude', 'Latitude'], axis=1),
    crs={'init': 'epsg:4326'},
    geometry=[Point(xy) for xy in zip(df.Longitude, df.Latitude)])
    return gdf

result_gdf=convert_crs(grid_df)
code_bug
  • 355
  • 1
  • 12
  • 1
    looks like https://github.com/geopandas/dask-geopandas is a potential answer. the code you pasted into question looks wrong. a CRS projection is done with `to_crs()` not by stating it's CRS is the target when all the geometry is in another CRS ... – Rob Raymond Sep 02 '22 at 15:21
  • Yeah also I’d check out `gdf.points_from_xy` Which will create your geometry array a whole lot faster than looping over each point – Michael Delgado Sep 02 '22 at 15:37
  • if you want help on this question though, it would be helpful if you could provide a complete example. the syntax in your example is invalid, and you don't define all terms. Check out the guide to creating a [mre]. it's hard to tell from your example exactly what you're doing, but Rob's point that you should be able to just use `grid_df.to_crs("epsg:4326")` after *creating* the GeoDataFrame with CRS 3857 is right on. – Michael Delgado Sep 02 '22 at 17:42
  • Hi @MichaelDelgado, Thanks for the reply . we can use To_crs( ) function but my issue is because the dataset is huge geopandas is breaking . is there any other alternative approach to tackle this issue. – code_bug Sep 02 '22 at 18:07

3 Answers3

2

See: https://github.com/geopandas/geopandas/issues/1400

This is very fast and memory efficient:

from pyproj import Transformer

trans = Transformer.from_crs(
    "EPSG:4326",
    "EPSG:3857",
    always_xy=True,
)
xx, yy = trans.transform(df["Longitude"].values, df["Latitude"].values)
df["X"] = xx
df["Y"] = yy
snowman2
  • 646
  • 4
  • 11
0

See the geopandas docs on installation and make sure you have the latest version of geopandas and PyGeos installed. From the installation docs:

Work is ongoing to improve the performance of GeoPandas. Currently, the fast implementations of basic spatial operations live in the PyGEOS package (but work is under way to contribute those improvements to Shapely). Starting with GeoPandas 0.8, it is possible to optionally use those experimental speedups by installing PyGEOS.

Note the caveat that to_crs will ignore & drop any z coordinate information, so if this is important you unfortunately cannot use these speedups and something like dask_geopandas may be required.

However, with a recent version of geopandas and PyGeos installed, converting the CRS of 50 million points should be possible. The following generates 50m random points (<1s), creates a GeoDataFrame with geometries from the points in WGS84 (18s), converts them all to web mercator (1m21s) and then converts them back to WGS84 (54s):

In [1]: import geopandas as gpd, pandas as pd, numpy as np

In [2]: %%time
   ...: n = int(50e6)
   ...: lats = np.random.random(size=n) * 180 - 90
   ...: lons = np.random.random(size=n) * 360 - 180
   ...:
   ...:
CPU times: user 613 ms, sys: 161 ms, total: 774 ms
Wall time: 785 ms

In [3]: %%time
   ...: df = gpd.GeoDataFrame(geometry=gpd.points_from_xy(lons, lats, crs="epsg:4326"))
   ...:
   ...:
CPU times: user 11.7 s, sys: 4.66 s, total: 16.4 s
Wall time: 17.8 s

In [4]: %%time
   ...: df_mercator = df.to_crs("epsg:3857")
   ...:
   ...:
CPU times: user 1min 1s, sys: 13.7 s, total: 1min 15s
Wall time: 1min 21s

In [5]: %%time
   ...: df_wgs84 = df_mercator.to_crs("epsg:4326")
   ...:
   ...:
CPU times: user 39.4 s, sys: 9.59 s, total: 49 s
Wall time: 54 s

I ran this on a 2021 Apple M1 Max chip with 32 GB of memory using Geopandas v0.10.2 and PyGeos v0.12.0. The real memory usage peaked at around 9 GB - it's possible your computer is facing memory constraints or the runtime may be an issue. If so, additional debugging details and the full workflow would definitely be helpful! But this seems like a workflow that should be doable on most computers - you may need to partition the data and work through it in chunks if you're facing memory constraints but it's within a single order of magnitude of what most computers should be able to handle.

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
0

I hope this answer is fair enough, because it will effectively solve your problem for any size of dataset. And it's a well-trodden kind of answer to how to deal with data that's too big for memory.

Answer: Store your data in PostGIS

You would then have two options for doing what you want.

  1. Do data manipulations in PostGIS, using its geo-spatial SQL syntax. The database will do the memory management for you.
  2. Retrieve data a chunk at a time, do the manipulation in GeoPandas and rewrite to the database.

In my experience it's solid, reliable and pretty well integrated with GeoPandas via GeoAlchemy2.

David Harris
  • 646
  • 3
  • 11