5

I'm looking to split a string Series at different points depending on the length of certain substrings:

In [47]: df = pd.DataFrame(['group9class1', 'group10class2', 'group11class20'], columns=['group_class'])
In [48]: split_locations = df.group_class.str.rfind('class')
In [49]: split_locations
Out[49]: 
0    6
1    7
2    7
dtype: int64
In [50]: df
Out[50]: 
      group_class
0    group9class1
1   group10class2
2  group11class20

My output should look like:

      group_class    group    class
0    group9class1   group9   class1
1   group10class2  group10   class2
2  group11class20  group11  class20

I half-thought this might work:

In [56]: df.group_class.str[:split_locations]
Out[56]: 
0   NaN
1   NaN
2   NaN

How can I slice my strings by the variable locations in split_locations?

Community
  • 1
  • 1
LondonRob
  • 73,083
  • 37
  • 144
  • 201
  • This works: `df[['group_class']].apply(lambda x: x.str[:split_locations[x.name]], axis=1)` the `axis=1` and use of double `[]` is required to force it to a `df.apply` and to get access to the row index to index into the split_locations – EdChum Aug 07 '15 at 15:22
  • 2
    If you're open to an alternative to slicing, I'd go for `df.group_class.str.extract(r'(?Pgroup[0-9]+)(?Pclass[0-9]+)')` – Alex Riley Aug 07 '15 at 15:25
  • Learned loads of great stuff from the answers to this question. @Rob: sorry I had to accept just one. See [this question](http://stackoverflow.com/questions/31801079/attributes-information-contained-in-dataframe-column-names/31861473#31861473) for some context on why I asked the question in the first place. – LondonRob Aug 07 '15 at 16:08

3 Answers3

3

This works, by using double [[]] you can access the index value of the current element so you can index into the split_locations series:

In [119]:    
df[['group_class']].apply(lambda x: pd.Series([x.str[split_locations[x.name]:][0], x.str[:split_locations[x.name]][0]]), axis=1)
Out[119]:
         0        1
0   class1   group9
1   class2  group10
2  class20  group11

Or as @ajcr has suggested you can extract:

In [106]:

df['group_class'].str.extract(r'(?P<group>group[0-9]+)(?P<class>class[0-9]+)')
Out[106]:
     group    class
0   group9   class1
1  group10   class2
2  group11  class20

EDIT

Regex explanation:

the regex came from @ajcr (thanks!), this uses str.extract to extract groups, the groups become new columns.

So ?P<group> here identifies an id for a specific group to look for, if this is missing then an int will be returned for the column name.

so the rest should be self-explanatory: group[0-9] looks for the string group followed by the digits in range [0-9] which is what the [] indicate, this is equivalent to group\d where \d means digit.

So it could be re-written as:

df['group_class'].str.extract(r'(?P<group>group\d+)(?P<class>class\d+)')
Community
  • 1
  • 1
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Actually I think the `str.extract` method is better as this produces both columns, see my update – EdChum Aug 07 '15 at 15:36
  • I know this is not necessarily the best place for it, but any commentary on how you came up with those `regex`s would be welcome. – LondonRob Aug 07 '15 at 15:45
  • Sure, hope it makes more sense, I'm not a regex ninja but this regex is pretty simple here – EdChum Aug 07 '15 at 15:51
  • [Here](http://stackoverflow.com/questions/10059673/named-regular-expression-group-pgroup-nameregexp-what-does-p-stand-for) is a good discussion about `?P` in regex. (This is actually mentioned in the docstring for Series.str.extract!) – LondonRob Aug 07 '15 at 15:55
2

Use a regular expression to split the string

 import re

 regex = re.compile("(class)")
 str="group1class23"
 # this will split the group and the class string by adding a space between them, and using a simple split on space.
 split_string = re.sub(regex, " \\1", str).split(" ")

This will return the array:

 ['group9', 'class23']

So to append two new columns to your DataFrame you can do:

new_cols = [re.sub(regex, " \\1", x).split(" ") for x in df.group_class]
df['group'], df['class'] = zip(*new_cols)

Which results in:

      group_class    group    class
0    group9class1   group9   class1
1   group10class2  group10   class2
2  group11class20  group11  class20
LondonRob
  • 73,083
  • 37
  • 144
  • 201
Rob
  • 2,618
  • 2
  • 22
  • 29
  • 1
    Hope you don't mind. I've added the code which will actually produce the desired output. Yes, I had fun! – LondonRob Aug 07 '15 at 15:41
2

You can also use zip together with a list comprehension.

df['group'], df['class'] = zip(
    *[(string[:n], string[n:]) 
      for string, n in zip(df.group_class, split_locations)])

>>> df
      group_class    group    class
0    group9class1   group9   class1
1   group10class2  group10   class2
2  group11class20  group11  class20
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • This wins the instantly-clear-what's-going-on contest! – LondonRob Aug 07 '15 at 15:44
  • Zen: 3) Simple is better than complex. 7) Readability counts. – Alexander Aug 07 '15 at 15:52
  • I love this, but the existence of `Series.str.extract` plus the `?P` thing in [this answer](http://stackoverflow.com/a/31881349/2071807) blew my mind, so I've accepted that one instead. Thanks though! – LondonRob Aug 07 '15 at 16:02