0

I have a series of geographic coordinates as below and am fairly new to Series. I need to extract the latitude and longitude from these coordinates into separate series to concatenate back into the dataframe. I have tried a few different methods w/o luck. Can someone please help me understand how best to do this?

Code I have tried:

printing type of this is a <list> but just returns the first record in the series

df['Location'].str.split(',')[0]

treat as string and try to extract the first list element and then the first of the tuple

longitude = df['Location'].str[0][0].replace('(', '')

The Location field is a series of lists of tuples.

0         [(34.0255,  -118.3002)]
1         [(34.0256,  -118.3089)]
2         [(34.0738,  -118.2078)]
3         [(34.0492,  -118.2391)]
4         [(34.0108,  -118.3182)]
                   ...
594800    [(34.0436,  -118.3053)]
594801    [(33.9572,  -118.3962)]
594802    [(34.1684,  -118.3982)]
594803    [(34.1976,  -118.3965)]
594804    [(34.0504,  -118.3442)]
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Cord
  • 317
  • 2
  • 10

2 Answers2

0

Based on the data from the comment, Location is actually a string, containing a tuple. And there's a None among the values.

So we need to parse the coordinates and be careful of Nones.

I have tried two approaches

  1. Using ast.literal_eval to parse a tuple from a string. Convert a String representation of a Dictionary to a dictionary
import ast

def get_coordinate_ast(location, coord_idx):
    try:
        return ast.literal_eval(location)[coord_idx]
    except Exception as e:
        print(e)
        return
%%time
# getting lat
df["lat_ast"] = df["Location"].apply(lambda l: get_coordinate_ast(l, 0))
# getting long
df["lon_ast"] = df["Location"].apply(lambda l: get_coordinate_ast(l, 1))

print(f"Not able to parse lat: {df['lat_ast'].isna().sum()}")
print(f"Not able to parse lon: {df['lon_ast'].isna().sum()}")

Output:

malformed node or string: nan
malformed node or string: nan
Not able to parse lat: 1
Not able to parse lon: 1
CPU times: user 6.41 s, sys: 57.5 ms, total: 6.47 s
Wall time: 6.48 s
  1. Using regexp:
%%time
# getting lat
df["lat_re"] = df["Location"].apply(lambda l: get_coordinate_re(l, 0))
# getting long
df["lon_re"] = df["Location"].apply(lambda l: get_coordinate_re(l, 1))

print(f"Not able to parse lat: {df['lat_re'].isna().sum()}")
print(f"Not able to parse lon: {df['lon_re'].isna().sum()}")

Output:

expected string or bytes-like object
expected string or bytes-like object
Not able to parse lat: 1
Not able to parse lon: 1
CPU times: user 941 ms, sys: 16.8 ms, total: 958 ms
Wall time: 960 ms

Both methods produce only 1 unparsed value (last row). And give equal results

print((df["lat_ast"] != df["lat_re"]).sum())
print((df["lon_ast"] != df["lon_re"]).sum())

Output:

1
1

(These are both Nones, that didn't match)

However, re method is much more quicker: 960 ms vs 6.48 s.

Maria K
  • 1,491
  • 1
  • 3
  • 14
  • I get an index error which is likely because the data is not exactly as i described it. The data is from here: https://data.lacity.org/Public-Safety/Traffic-Collision-Data-from-2010-to-Present/d5tf-ez2w - see export in top left. From there, I loaded a data frame with df = pd.read_csv('../data/Traffic_Collision_Data_from_2010_to_Present.csv') and then try to manipulate the Location field. – Cord Jul 30 '23 at 19:29
  • I have just rewritten the answer to deal with the new info. – Maria K Jul 30 '23 at 21:29
  • For completeness and my understanding, what is the function get_coordinate_re(location, index). Thank you – Cord Jul 31 '23 at 21:11
  • I created a function to naively parse the location value - that now represents as a string def get_lat_lon(location, index): try: lat_lon = location.split(',') if index == 0: return lat_lon[0].replace('(', '') elif index == 1: return lat_lon[1].replace(')', '') except Exception as e: print(e) return 0 – Cord Aug 01 '23 at 03:16
0

Construct a new dataframe from records:

pd.DataFrame.from_records(df['Location'].apply(pd.Series)[0].values, 
                          columns=['lat', 'long'])

       lat      long
0  34.0255 -118.3002
1  34.0256 -118.3089
2  34.0738 -118.2078
3  34.0492 -118.2391
4  34.0108 -118.3182
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • I ran this and got this error: AssertionError: 2 columns passed, passed data had 20 columns - please see comment to Maria K's solution, my full dataset has 20 columns. Thank you! – Cord Jul 30 '23 at 19:33
  • @Cord, so how are you asserting that? – RomanPerekhrest Jul 31 '23 at 07:46