2

I am currently working with an .xml file that I have converted into a data frame that looks like so:

enter image description here

I want to split the Coordinates column into 4 separate columns with the following layout:

to_longitude, to_latitude, from_longitude, from_latitude

I am attempting to do this with the code below:

pd.concat([df[[0]], df[1].str.split(',', expand=True)], axis=1)

However, this gives me the following error:

KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"

My question is what am I doing wrong and how can I correct my code to make it work as intended?

SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Please include a [`reproducible example`](https://stackoverflow.com/a/20159305/4985099) in the post, instead of a image along with the expected output. – sushanth Jul 27 '21 at 12:12
  • can you post a sample of your dataset as text so that we can replicate please? You can use ```df.to_dict()``` – sophocles Jul 27 '21 at 12:13
  • Execution time comparison of the solutions has been benchmarked below for your reference. – SeaBean Jul 27 '21 at 15:03

2 Answers2

4

Consider using Pandas apply function -

def my_func(record):
    record['to_longitude']=record['Coordinates'][0][0]
    record['to_latitude']=record['Coordinates'][0][1]
    record['from_longitude']=record['Coordinates'][1][0]
    record['from_latitude']=record['Coordinates'][1][1]
    return record


new_df = df.apply(my_func, axis=1)
Tom Ron
  • 5,906
  • 3
  • 22
  • 38
1

You can use the string accessor str[] to get the values of nested tuples to set up the 4 columns, as follows:

df['to_longitude'] = df['Coordinates'].str[0].str[0]
df['to_latitude'] = df['Coordinates'].str[0].str[1]
df['from_longitude'] = df['Coordinates'].str[1].str[0]
df['from_latitude'] = df['Coordinates'].str[1].str[1]

Demo

data = {'Link': ['abd', 'abe'], 
        'Coordinates': [((-4.21, 55.85), (-4.22, 55.86)), ((-4.25, 55.82), (-4.26, 55.83))]}
df = pd.DataFrame(data)

  Link                       Coordinates
0  abd  ((-4.21, 55.85), (-4.22, 55.86))
1  abe  ((-4.25, 55.82), (-4.26, 55.83))

df['to_longitude'] = df['Coordinates'].str[0].str[0]
df['to_latitude'] = df['Coordinates'].str[0].str[1]
df['from_longitude'] = df['Coordinates'].str[1].str[0]
df['from_latitude'] = df['Coordinates'].str[1].str[1]

  Link                       Coordinates  to_longitude  to_latitude  from_longitude  from_latitude
0  abd  ((-4.21, 55.85), (-4.22, 55.86))         -4.21        55.85           -4.22          55.86
1  abe  ((-4.25, 55.82), (-4.26, 55.83))         -4.25        55.82           -4.26          55.83

Execution time comparison:

Test data of 40,000 rows

df2 = pd.concat([df] * 20000, ignore_index=True)

Solution 1: Tom Ron's solution

def my_func(record):
    record['to_longitude']=record['Coordinates'][0][0]
    record['to_latitude']=record['Coordinates'][0][1]
    record['from_longitude']=record['Coordinates'][1][0]
    record['from_latitude']=record['Coordinates'][1][1]
    return record

%timeit new_df = df2.apply(my_func, axis=1)

Result:

1min 16s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2: SeaBean's solution

%%timeit
df2['to_longitude'] = df2['Coordinates'].str[0].str[0]
df2['to_latitude'] = df2['Coordinates'].str[0].str[1]
df2['from_longitude'] = df2['Coordinates'].str[1].str[0]
df2['from_latitude'] = df2['Coordinates'].str[1].str[1]

Result:

165 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Solution 3: Anurag Dabas' solution

%%timeit
cols=['to_longitude','to_latitude','from_longitude','from_latitude']
out=pd.DataFrame(np.hstack(df2['Coordinates'].values),columns=cols)
#OR
#out=pd.DataFrame(np.concatenate(df['Coordinates'].values,axis=1),columns=cols)

Result: Can't get the benchmarking since got error for both options:

ValueError: Shape of passed values is (2, 80000), indices imply (2, 4)

Summary

Solution 1: Tom Ron's solution

1min 16s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2: SeaBean's solution

165 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Solution 3: Anurag Dabas' solution

Can't get the benchmarking since got error for large dataset both options

For the first 2 solutions with benchmarking results, SeaBean's solution is 460x times faster than Tom Ron's solution (165ms vs 1min 16s) for 40,000 rows of data.

The faster execution time is contributed by all vectorized Pandas operations (optimized with fast C or CPython codes) used in codes, instead of the slow apply() on axis=1 which under the hood is slow Python looping.

SeaBean
  • 22,547
  • 3
  • 13
  • 25