3

I have a pandas DataFrame containing a string column which needs splitting into two separate columns. The answer using tolist that I found on SO from this question works like a charm, except when my column contains NaNs. The excerpt below describes the difficulty:

import pandas as pd
import numpy as np

# Example DataFrame
df = pd.DataFrame([[25.0, '34.2/ 18.1', 'one'],
                   [32.6, '28.6/ 17.9', 'two'],
                   [12.5, '30.1/ 17.6', 'three']], columns=['A', 'B', 'C'])
df2 = df.copy()

# This method works when all data are present
df['D'] = pd.DataFrame(df['B'].str.split('/').tolist())[1]

# However, when there are NaNs:
df2['B'][0] = np.nan

# This line fails
df2['D'] = pd.DataFrame(df2['B'].str.split('/').tolist())[1]

It gives me a KeyError, because the intermediate DataFrame only has one column, indicating that the bother of going to a list and back doesn't accomplish anything anymore:

               0
0            NaN
1  [28.6,  17.9]
2  [30.1,  17.6]

I've tried dropping the NaN first via pd.DataFrame(df2['B'].str.split('/').dropna().tolist()), but then I lose my index ... I need to keep the NaN at index 0. I've also thought of somehow duplicating the NaN in the creation of the intermediate DataFrame to force the two columns, but am having no luck.

This is what I would need my data to look like for df2:

      A           B      C     D
0  25.0         NaN    one   NaN
1  32.6  28.6/ 17.9    two  17.9
2  12.5  30.1/ 17.6  three  17.6

Is there a way to do this without using a list as an intermediary? Or somehow deal with the NaN?

Community
  • 1
  • 1
Ajean
  • 5,528
  • 14
  • 46
  • 69

2 Answers2

4

You can continue to use your method if you use the str accessor again after the split (instead of using tolist() and making another DataFrame):

>>> df2['D'] = df2['B'].str.split('/').str[-1]
>>> df2
      A           B      C      D
0  25.0         NaN    one    NaN
1  32.6  28.6/ 17.9    two   17.9
2  12.5  30.1/ 17.6  three   17.6

This returns NaN if the index doesn't exist, instead of raising the error.

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • Perfect! I didn't realize you could use the `str` like that - both answers are doubly good because they don't require the `tolist()` funny business. – Ajean Feb 06 '15 at 22:48
  • This is roughly twice as fast as the `str.extract` method I suggested (particularly when applied to larger DataFrames). – unutbu Feb 07 '15 at 00:51
1

The str.extract method allows you to provide a regex pattern. Each group in the pattern is returned as a separate column. NaN is used when no match is found:

df2['D'] = df2['B'].str.extract(r'/(.*)')
print(df2)

yields

      A           B      C      D
0  25.0         NaN    one    NaN
1  32.6  28.6/ 17.9    two   17.9
2  12.5  30.1/ 17.6  three   17.6

Note that if you want the D column to be treated as floats, then you'll also need to call astype:

df2['D'] = df2['D'].astype('float')
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Excellent. I'm only just coming up to speed on the `str` methods, but this one looks like it will be very useful for parsing some more complicated files that I currently see in my future (handy for when you can't control the file format...) – Ajean Feb 06 '15 at 22:50