0
>e = {0: pd.Series(['NHL_toronto_maple-leafs_Canada', 'NHL_boston_bruins_US', 'NHL_detroit_red-wings', 'NHL_montreal'])}

>df = pd.DataFrame(e)

>df

    0
0   NHL_toronto_maple-leafs_Canada
1   NHL_boston_bruins_US
2   NHL_detroit_red-wings
3   NHL_montreal

I want to:

1) split the above dataframe (Series) by '_'

2) drop the 'NHL' string

3) recombine the remaining text by '_'

4) attach the result in #3 to the original dataframe as the second column

To do this I tried the following:

>df2 = df.icol(0).str.split('_').apply(pd.Series).iloc[:,1:]

>df2

    1   2   3
0   toronto maple-leafs Canada
1   boston  bruins  US
2   detroit red-wings   NaN
3   montreal    NaN NaN

I tried to follow the suggestion in combine columns in Pandas by doing something like:

>df2['4'] = df2.iloc[:,0] + "_" + df2.iloc[:,1] + "_" + df2.iloc[:,2]

>df2

    1   2   3   4
0   toronto maple-leafs Canada  toronto_maple-leafs_Canada
1   boston  bruins  US  boston_bruins_US
2   detroit red-wings   NaN NaN
3   montreal    NaN NaN NaN

However, you can see that in situations where a combine involves a cell that is NaN the end result is NaN as well. This is not what I want.

Column 4 should look like:

toronto_maple-leafs_Canada
boston_bruins_US
detroit_red-wings_US
montreal

Also is there an efficient way to do this type of operation as my real data set is quite large.

Community
  • 1
  • 1
codingknob
  • 11,108
  • 25
  • 89
  • 126

2 Answers2

2

If you're just looking to remove starting 'NHL_' substring, you could just

In [84]: df[0].str[4:]
Out[84]:
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

However, if you need to split and join, you could use a string method like-

In [85]: df[0].str.split('_').str[1:].str.join('_')
Out[85]:
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

Alternatively, you could also use apply

In [86]: df[0].apply(lambda x: '_'.join(x.split('_')[1:])) # Also, x.split('_', 1)[1]
Out[86]:
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

And, as @DSM pointed out - "split accepts an argument for the maximum number of splits"

In [87]: df[0].str.split("_", 1).str[1]
Out[87]:
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

Depending on the size of your data, you could benchmark these methods and use appropriate one.

Zero
  • 74,117
  • 18
  • 147
  • 154
  • `split` accepts an argument for the maximum number of splits, so `df[0].str.split("_", 1).str[1]` works too. Although I agree that if we know that everything starts with `NHL_`, simply removing it is far and away the simplest. – DSM May 01 '15 at 19:45
  • Right. Didn't think about the arguments! Good catch. Thanks. – Zero May 01 '15 at 19:49
1

You could use apply like this :

In [1782]: df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
Out[1782]: 
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

In [1783]: df[0] = df[0].apply(lambda v: '_'.join(v.split('_')[1:]))

Surprisingly, applying str seem to be taking longer than apply :

In [1811]: %timeit df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
10000 loops, best of 3: 127 µs per loop

In [1810]: %timeit df[0].str[4:]
1000 loops, best of 3: 179 µs per loop

In [1812]: %timeit df[0].str.split('_').str[1:].str.join('_')
1000 loops, best of 3: 553 µs per loop

In [1813]: %timeit df[0].str.split("_", 1).str[1]
1000 loops, best of 3: 374 µs per loop
fixxxer
  • 15,568
  • 15
  • 58
  • 76
  • `df[0].str[4:]` is fastest for rows > 1k. Depends on the size of data you tested. However, you could speed up your solution by using `lambda v: v.split('_', 1)[1]` – Zero May 01 '15 at 20:02
  • @fixxer & John Galt - thank you kindly to both of you for the detailed solutions. I will use cycle through them all and see which one is best for performance. The real data arrives on a daily basis and is only several hundred rows. The entire data set is in the millions but daily only a few hundred. – codingknob May 02 '15 at 19:31
  • @JohnGalt - I also noticed that the lambda function is the fastest. Faster than df[0].str.split("_", 1).str[1]. The size of the data set is a few hundred rows. – codingknob May 02 '15 at 19:46