24

How do I convert this dataframe

                                          location  value                       
0                   (Richmond, Virginia, nan, USA)    100                       
1              (New York City, New York, nan, USA)    200                       

to this:

    city            state       region    country   value
0   Richmond        Virginia    nan       USA       100
1   New York City   New York    nan       USA       200

Note that the location column in the first dataframe contains tuples. I want to create four columns out of the location column.

ba_ul
  • 2,049
  • 5
  • 22
  • 35

4 Answers4

15
new_col_list = ['city','state','regions','country']
for n,col in enumerate(new_col_list):
    df[col] = df['location'].apply(lambda location: location[n])

df = df.drop('location',axis=1)
exp1orer
  • 11,481
  • 7
  • 38
  • 51
7

If you return a Series of the (split) location, you can merge (join to merge on index) the resulting DF directly with your value column.

addr = ['city', 'state', 'region', 'country']
df[['value']].join(df.location.apply(lambda loc: Series(loc, index=addr)))

   value           city     state  region country
0    100       Richmond  Virginia     NaN     USA
1    200  New York City  New York     NaN     USA
meloncholy
  • 2,122
  • 18
  • 16
  • Thanks. @exp1orer's solution worked, so I didn't have to try yours. – ba_ul Aug 29 '14 at 00:23
  • That is nicely done! If I'm not mistaken, it only works if `df` has index that is `range(len(df))`, right? – exp1orer Aug 29 '14 at 05:10
  • @exp1orer Not quite. `join` is shorthand for merging on index with both frames, so the indices need only be consistent (which it will be here as the apply and col selection don't affect it). I'll edit the answer. – meloncholy Aug 29 '14 at 06:37
  • 6
    Should have thought about this earlier, but worth noting that this is **much** slower than @exp1orer's solution on a largish frame due to all of the Series being created. – meloncholy Aug 29 '14 at 07:11
3

I haven't timed this, but I would suggest this option:

df.loc[:,'city']=df.location.map(lambda x:x[0])
df.loc[:,'state']=df.location.map(lambda x:x[1])
df.loc[:,'regions']=df.location.map(lambda x:x[2])
df.loc[:,'country']=df.location.map(lambda x:x[3])

I'm guessing avoiding explicit for loop might lend itself to a SIMD instruction (certainly numpy looks for that, but perhaps not other libraries)

Martin Alley
  • 121
  • 1
  • 9
0

I prefer to use pd.DataFrame.from_records to convert the tuples to Series. Then this can be joined to the previous dataset as described by meloncholy.

df = pd.DataFrame({"location":[("Richmond", "Virginia", pd.NA, "USA"),
                               ("New York City", "New York", pd.NA, "USA")],
                   "value": [100,200]})

loc = pd.DataFrame.from_records(df.location, columns=['city','state','regions','country'])
df.drop("location", axis=1).join(loc)

from_records does assume a sequential index. If this is not the case you should pass the index to the new DataFrame:

loc = pd.DataFrame.from_records(df.location.reset_index(drop=True),
                                columns=['city','state','regions','country'],
                                index=df.index)
Quantum7
  • 3,165
  • 3
  • 34
  • 45