5

The Setup

I have a pandas dataframe that contains a column 'iso' containing chemical isotope symbols, such as '4He', '16O', '197Au'. I want to label many (but not all) isotopes on a plot using the annotate() function in matplotlib. The label format should have the atomic mass in superscript. I can do this with the LaTeX style formatting:

axis.annotate('$^{4}$He', xy=(x, y), xycoords='data')

I could write dozens of annotate() statements like the one above for each isotope I want to label, but I'd rather automate.

The Question

How can I extract the isotope number and name from my iso column?

With those pieces extracted I can make the labels. Lets say we dump them into the variables Num and Sym. Now I can loop over my isotopes and do something like this:

for i in list_of_isotopes:
  (Num, Sym) = df[df.iso==i].iso.str.MISSING_STRING_METHOD(???)
  axis.annotate('$^{%s}$%s' %(Num, Sym), xy=(x[Num], y[Num]), xycoords='data')

Presumably, there is a pandas string methods that I can drop into the above. But I'm having trouble coming up with a solution. I've been trying split() and extract() with a few different patterns, but can't get the desired effect.

Paul T.
  • 326
  • 1
  • 2
  • 11
  • Maybe this can help to split your `iso` column. It will create a column for each token returned by the `split`. Could you provide an example of the data to split and the pattern to match ? `df = pd.DataFrame('part1_part2', index=range(0,3), columns=['iso']) df['iso'].str.split('_', expand=True)`. – Romain Aug 26 '15 at 14:39
  • That would require my column to already have an underscore... which it does not. – Paul T. Aug 26 '15 at 14:43
  • Check my answer using an improvable `regexp` to split the string. – Romain Aug 26 '15 at 15:05

5 Answers5

12

This is my answer using split. The regexp used can be improved, I'm very bad at that sort of things :-)

(\d+) stands for the integers, and ([A-Za-z]+) stands for the strings.

df = pd.DataFrame({'iso': ['4He', '16O', '197Au']})
result = df['iso'].str.split('(\d+)([A-Za-z]+)', expand=True)
result = result.loc[:,[1,2]]
result.rename(columns={1:'x', 2:'y'}, inplace=True)
print(result)

Produces

     x   y
0    4  He
1   16   O
2  197  Au
Jinhua Wang
  • 1,679
  • 1
  • 17
  • 44
Romain
  • 19,910
  • 6
  • 56
  • 65
  • 1
    This scales nicely. I can easily run this on all 2000+ isotopes in my original DataFrame. Plus I added the created columns to the existing DataFrame. – Paul T. Aug 26 '15 at 15:16
  • Great, I have just made a small improvement to my answer ! I still do not know why there is two additional columns generated. I'm sort of allergic to `regexp`. – Romain Aug 26 '15 at 15:21
  • @Fei Yuan To reply to your edit suggest the `expand` parameter is "New in version 0.16.1." as mentioned in the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html). Maybe it is why the code does not work in your environment. – Romain Aug 26 '15 at 15:58
  • 1
    [`[A-z]` matches more than just ASCII letters.](https://stackoverflow.com/a/29771926/3832970) I changed it to `[A-Za-z]`. – Wiktor Stribiżew May 22 '19 at 09:37
  • 1
    This scales very well, is fast, and is the cleanest way to do it. – msarafzadeh Oct 03 '19 at 08:57
  • The additional columns are generated because split splits on the pattern match, giving also whats before and after. Probably better to use [extract](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html#pandas.Series.str.extract). – Simon Jun 06 '23 at 07:31
1

The accepted answer gave me the right direction, but I think the right pandas function to use is extract. Like this only the matched regular expressions are returned, eliminating the use to slice afterwards.

df = pd.DataFrame({'iso': ['4He', '16O', '197Au']})
df[['num', 'element']] = df['iso'].str.extract('(\d+)([A-Za-z]+)', expand=True)
print(df)

gives

     iso  num element
0    4He    4      He
1    16O   16       O
2  197Au  197      Au
Simon
  • 495
  • 1
  • 4
  • 18
0

I'd use simple string manipulation, without the hassle of regex.

isotopes = ['4He', '16O', '197Au']
def get_num(isotope):
    return filter(str.isdigit, isotope)

def get_sym(isotope):
    return isotope.replace(get_num(isotope),'')

def get_num_sym(isotope):
    return (get_num(isotope),get_sym(isotope))


for isotope in isotopes:
    num,sym = get_num_sym(isotope)
    print num,sym
taesu
  • 4,482
  • 4
  • 23
  • 41
0

To extract the number and the element of an isotope symbol you can use a regular expression (short: regex) in combination with Python's re module. The regex looks for number digits and after that it looks for characters which are grouped and accessible using the group's name. If the regex matches you can extract the data and .format() the desired annotation string:

#!/usr/bin/env python3
# coding: utf-8

import re

iso_num = '16O'

preg = re.compile('^(?P<num>[0-9]*)(?P<element>[A-Za-z]*)$')
m = preg.match(iso_num)

if m:
    num = m.group('num')
    element = m.group('element')

    note = '$^{}${}'.format(num, element)

    # axis.annotate(note, xy=(x, y), xycoords='data')
albert
  • 8,027
  • 10
  • 48
  • 84
0

Did you tried strip(), maybe you can consider this:

import string

for i in list_of_isotopes:
  Num = df[df.iso==i].iso.str.strip(string.ascii_letters)
  Sym = df[df.iso==i].iso.str.strip(string.digits)
  axis.annotate('$^%s$%s' %(Num, Sym), xy=(x[Num], y[Num]), xycoords='data')
Fei Yuan
  • 82
  • 5