split string into number and text with pandas

Question

The Setup

I have a pandas dataframe that contains a column 'iso' containing chemical isotope symbols, such as '4He', '16O', '197Au'. I want to label many (but not all) isotopes on a plot using the annotate() function in matplotlib. The label format should have the atomic mass in superscript. I can do this with the LaTeX style formatting:

axis.annotate('$^{4}$He', xy=(x, y), xycoords='data')

I could write dozens of annotate() statements like the one above for each isotope I want to label, but I'd rather automate.

The Question

How can I extract the isotope number and name from my iso column?

With those pieces extracted I can make the labels. Lets say we dump them into the variables Num and Sym. Now I can loop over my isotopes and do something like this:

for i in list_of_isotopes:
  (Num, Sym) = df[df.iso==i].iso.str.MISSING_STRING_METHOD(???)
  axis.annotate('$^{%s}$%s' %(Num, Sym), xy=(x[Num], y[Num]), xycoords='data')

Presumably, there is a pandas string methods that I can drop into the above. But I'm having trouble coming up with a solution. I've been trying split() and extract() with a few different patterns, but can't get the desired effect.

Maybe this can help to split your `iso` column. It will create a column for each token returned by the `split`. Could you provide an example of the data to split and the pattern to match ? `df = pd.DataFrame('part1_part2', index=range(0,3), columns=['iso']) df['iso'].str.split('_', expand=True)`. — Romain, Aug 26 '15 at 14:39
That would require my column to already have an underscore... which it does not. — Paul T., Aug 26 '15 at 14:43
Check my answer using an improvable `regexp` to split the string. — Romain, Aug 26 '15 at 15:05

score 12 · Accepted Answer · edited Feb 20 '20 at 05:22

12

This is my answer using split. The regexp used can be improved, I'm very bad at that sort of things :-)

(\d+) stands for the integers, and ([A-Za-z]+) stands for the strings.

df = pd.DataFrame({'iso': ['4He', '16O', '197Au']})
result = df['iso'].str.split('(\d+)([A-Za-z]+)', expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'x', 2:'y'}, inplace=True)
print(result)

Produces

edited Feb 20 '20 at 05:22

Jinhua Wang

1,679
1
17
44

answered Aug 26 '15 at 15:05

Romain

19,910
6
56
65

1

This scales nicely. I can easily run this on all 2000+ isotopes in my original DataFrame. Plus I added the created columns to the existing DataFrame. – Paul T. Aug 26 '15 at 15:16
Great, I have just made a small improvement to my answer ! I still do not know why there is two additional columns generated. I'm sort of allergic to `regexp`. – Romain Aug 26 '15 at 15:21
@Fei Yuan To reply to your edit suggest the `expand` parameter is "New in version 0.16.1." as mentioned in the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html). Maybe it is why the code does not work in your environment. – Romain Aug 26 '15 at 15:58
1

[`[A-z]` matches more than just ASCII letters.](https://stackoverflow.com/a/29771926/3832970) I changed it to `[A-Za-z]`. – Wiktor Stribiżew May 22 '19 at 09:37
1

This scales very well, is fast, and is the cleanest way to do it. – msarafzadeh Oct 03 '19 at 08:57
The additional columns are generated because split splits on the pattern match, giving also whats before and after. Probably better to use [extract](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html#pandas.Series.str.extract). – Simon Jun 06 '23 at 07:31

score 1 · Answer 2 · answered Jun 06 '23 at 07:28

The accepted answer gave me the right direction, but I think the right pandas function to use is extract. Like this only the matched regular expressions are returned, eliminating the use to slice afterwards.

df = pd.DataFrame({'iso': ['4He', '16O', '197Au']})
df[['num', 'element']] = df['iso'].str.extract('(\d+)([A-Za-z]+)', expand=True)
print(df)

gives

     iso  num element
0    4He    4      He
1    16O   16       O
2  197Au  197      Au

score 0 · Answer 3 · answered Aug 26 '15 at 14:41

I'd use simple string manipulation, without the hassle of regex.

isotopes = ['4He', '16O', '197Au']
def get_num(isotope):
    return filter(str.isdigit, isotope)

def get_sym(isotope):
    return isotope.replace(get_num(isotope),'')

def get_num_sym(isotope):
    return (get_num(isotope),get_sym(isotope))


for isotope in isotopes:
    num,sym = get_num_sym(isotope)
    print num,sym

score 0 · Answer 4 · answered Aug 26 '15 at 14:42

To extract the number and the element of an isotope symbol you can use a regular expression (short: regex) in combination with Python's re module. The regex looks for number digits and after that it looks for characters which are grouped and accessible using the group's name. If the regex matches you can extract the data and .format() the desired annotation string:

#!/usr/bin/env python3
# coding: utf-8

import re

iso_num = '16O'

preg = re.compile('^(?P<num>[0-9]*)(?P<element>[A-Za-z]*)$')
m = preg.match(iso_num)

if m:
    num = m.group('num')
    element = m.group('element')

    note = '$^{}${}'.format(num, element)

    # axis.annotate(note, xy=(x, y), xycoords='data')

score 0 · Answer 5 · answered Aug 26 '15 at 14:54

Did you tried strip(), maybe you can consider this:

import string

for i in list_of_isotopes:
  Num = df[df.iso==i].iso.str.strip(string.ascii_letters)
  Sym = df[df.iso==i].iso.str.strip(string.digits)
  axis.annotate('$^%s$%s' %(Num, Sym), xy=(x[Num], y[Num]), xycoords='data')

split string into number and text with pandas

The Setup

The Question

5 Answers5

Linked