Convert pandas heading string containing chemical name to chemical symbol

Question

I have a macro that converts a cell string that contains a chemical name to just its chemical symbol. e.g. a cell that contains "Silver ICM-MS ug/kg" would become simply "Ag"(the chemical symbol for silver). I am trying to convert this to Python to use in processing of lab data where I would convert wordy column headings in a dataframe to just chemical symbols. Headings that don't contain an element name can be left as they are. My attempt isn't working:

#load dictionary of element names, import pandas and load dummy data
dict_of_elements={"Sulphate":"SO4","Silver":"Ag","Aluminium":"Al","Americium":"Am","Argon":"Ar","Arsenic":"As","Astatine":"At","Gold":"Au","Boron":"B","Barium":"Ba","Beryllium":"Be","Bohrium":"Bh","Bismuth":"Bi","Berkelium":"Bk","Bromine":"Br","Calcium":"Ca","Cadmium":"Cd","Cerium":"Ce","Californium":"Cf","Chlorine":"Cl","Curium":"Cm","Copernicium":"Cn","Cobalt":"Co","Chromium":"Cr","Caesium":"Cs","Copper":"Cu","Dubnium":"Db","Darmstadtium":"Ds","Dysprosium":"Dy","Erbium":"Er","Einsteinium":"Es","Europium":"Eu","Fluorine":"F","Iron":"Fe","Flerovium":"Fl","Fermium":"Fm","Francium":"Fr","Gallium":"Ga","Gadolinium":"Gd","Germanium":"Ge","Hydrogen":"H","Helium":"He","Hafnium":"Hf","Mercury":"Hg","Holmium":"Ho","Hassium":"Hs","Iodine":"I","Indium":"In","Iridium":"Ir","Potassium":"K","Krypton":"Kr","Lanthanum":"La","Lithium":"Li","Lawrencium":"Lr","Lutetium":"Lu","Livermorium":"Lv","Mendelevium":"Md","Magnesium":"Mg","Manganese":"Mn","Molybdenum":"Mo","Meitnerium":"Mt","Nitrogen":"N","Sodium":"Na","Niobium":"Nb","Neodymium":"Nd","Neon":"Ne","Nickel":"Ni","Nobelium":"No","Neptunium":"Np","Oxygen":"O","Osmium":"Os","Phosphorus":"P","Protactinium":"Pa","Lead":"Pb","Palladium":"Pd","Promethium":"Pm","Polonium":"Po","Praseodymium":"Pr","Platinum":"Pt","Plutonium":"Pu","Radium":"Ra","Rubidium":"Rb","Rhenium":"Re","Rutherfordium":"Rf","Roentgenium":"Rg","Rhodium":"Rh","Radon":"Rn","Ruthenium":"Ru","Sulfur":"S","Antimony":"Sb","Scandium":"Sc","Selenium":"Se","Seaborgium":"Sg","Silicon":"Si","Samarium":"Sm","Tin":"Sn","Strontium":"Sr","Tantalum":"Ta","Terbium":"Tb","Technetium":"Tc","Tellurium":"Te","Thorium":"Th","Titanium":"Ti","Thallium":"Tl","Thulium":"Tm","Uranium":"U","Ununoctium":"Uuo","Ununpentium":"Uup","Ununseptium":"Uus","Ununtrium":"Uut","Vanadium":"V","Tungsten":"W","Xenon":"Xe","Yttrium":"Y","Ytterbium":"Yb","Zinc":"Zn"}

import pandas as pd

data = {'Silver ICP-MS': [1,2,3], 'Aluminium(III)': [4,5,6], 'Sulphate-LECO': [7,8,9], 'NOT a match': [10,11,12]}
df = pd.DataFrame(data)

df

Out[1]:
   Aluminium(III)  NOT a match  Silver ICP-MS  Sulphate-LECO
0               4           10              1              7
1               5           11              2              8
2               6           12              3              9

I have tried variations along the lines of:

element_cols=[key for key in dict_of_elemets.keys()in df.columns]

print(element_cols)

but I'm still getting to grips with list comprehensions and it's not working.

Or I can loop through and get the matches in a list and then feed that into df.columns:

cols=[]
for i in df.columns:
    for key, value in dict_of_elements.items():
         i= i.replace(key, value)
    cols.append(i)
df.columns=cols    
df

which gives:

Out[3]:
   Al(III)  NOT a match  Ag ICP-MS  SO4-LECO
0        4           10          1         7
1        5           11          2         8
2        6           12          3         9

But the entire header should be replaced with the symbol "Ag, SO4...".

there is no sulphate in ur dict_of_elements dictionary – sammywemmy May 13 '20 at 03:27 — sammywemmy, May 13 '20 at 03:27
@sammywemmy well spotted, fixed. – flashliquid May 13 '20 at 03:29 — flashliquid, May 13 '20 at 03:29

flashliquid · Accepted Answer · 2021-01-04T07:03:23.280

This seems to work, plus I understand it.

Dummy data:

import pandas as pd

data = {'Silver ICP-MS': [1,2,3], 'Aluminium(III)': [4,5,6], 'Sulphate-LECO': [7,8,9], 'NOT a match': [10,11,12]}
df = pd.DataFrame(data)
print ('starting data')
display(df)

Code:

dict_of_elements={"Silver":"Ag","Aluminium":"Al","Americium":"Am","Argon":"Ar","Arsenic":"As","Astatine":"At","Gold":"Au","Boron":"B","Barium":"Ba","Beryllium":"Be","Bohrium":"Bh","Bismuth":"Bi","Berkelium":"Bk","Bromine":"Br","Calcium":"Ca","Cadmium":"Cd","Cerium":"Ce","Californium":"Cf","Chlorine":"Cl","Curium":"Cm","Copernicium":"Cn","Cobalt":"Co","Chromium":"Cr","Caesium":"Cs","Copper":"Cu","Dubnium":"Db","Darmstadtium":"Ds","Dysprosium":"Dy","Erbium":"Er","Einsteinium":"Es","Europium":"Eu","Fluorine":"F","Iron":"Fe","Flerovium":"Fl","Fermium":"Fm","Francium":"Fr","Gallium":"Ga","Gadolinium":"Gd","Germanium":"Ge","Hydrogen":"H","Helium":"He","Hafnium":"Hf","Mercury":"Hg","Holmium":"Ho","Hassium":"Hs","Iodine":"I","Indium":"In","Iridium":"Ir","Potassium":"K","Krypton":"Kr","Lanthanum":"La","Lithium":"Li","Lawrencium":"Lr","Lutetium":"Lu","Livermorium":"Lv","Mendelevium":"Md","Magnesium":"Mg","Manganese":"Mn","Molybdenum":"Mo","Meitnerium":"Mt","Nitrogen":"N","Sodium":"Na","Niobium":"Nb","Neodymium":"Nd","Neon":"Ne","Nickel":"Ni","Nobelium":"No","Neptunium":"Np","Oxygen":"O","Osmium":"Os","Phosphorus":"P","Protactinium":"Pa","Lead":"Pb","Palladium":"Pd","Promethium":"Pm","Polonium":"Po","Praseodymium":"Pr","Platinum":"Pt","Plutonium":"Pu","Radium":"Ra","Rubidium":"Rb","Rhenium":"Re","Rutherfordium":"Rf","Roentgenium":"Rg","Rhodium":"Rh","Radon":"Rn","Ruthenium":"Ru","Sulfur":"S","Antimony":"Sb","Scandium":"Sc","Selenium":"Se","Seaborgium":"Sg","Silicon":"Si","Samarium":"Sm","Tin":"Sn","Strontium":"Sr","Tantalum":"Ta","Terbium":"Tb","Technetium":"Tc","Tellurium":"Te","Thorium":"Th","Titanium":"Ti","Thallium":"Tl","Thulium":"Tm","Uranium":"U","Ununoctium":"Uuo","Ununpentium":"Uup","Ununseptium":"Uus","Ununtrium":"Uut","Vanadium":"V","Tungsten":"W","Xenon":"Xe","Yttrium":"Y","Ytterbium":"Yb","Zinc":"Zn", 'Sulphate':"SO4"}


cols=[]
for i in df.columns:
    for key, value in dict_of_elements.items():
        if key in i:
            i= dict_of_elements[key]
       
    cols.append(i)

df.columns=cols  
print ("processed data")
df

Gives:

starting data

   Aluminium(III)  NOT a match  Silver ICP-MS  Sulphate-LECO
0               4           10              1              7
1               5           11              2              8
2               6           12              3              9


processed data
Out[4]:
   Al  NOT a match  Ag  SO4
0   4           10   1    7
1   5           11   2    8
2   6           12   3    9

I will make the dictionary keys all uppercase and change to if key in i.upper() to capture different capitalisation.

sammywemmy · Answer 2 · 2020-05-13T03:38:17.230

0

This is one way to do it:

create a map from the cartesian of the columns and the dicts, then use the replace method to create ur updated column names :

from itertools import product
mapper = mapper = {k:s for k,(v,s) in 
                   product(df.columns,dict_of_elements.items())
                   if v in k}

mapper
{'Silver ICP-MS': 'Ag', 'Aluminium(III)': 'Al', 'Sulphate-LECO': 'SO4'}


df.columns =  df.columns.to_series().replace(mapper)
       Ag         Al      SO4           NOT a match
0      1           4       7              10
1      2           5       8              11
2      3           6       9              12

edited May 13 '20 at 03:38

answered May 13 '20 at 03:17

sammywemmy

27,093
4
17
31

Thanks Sammy, the header should consist of only the chem symbol. i.e. SO4, Ag, Al, NOT a match. – flashliquid May 13 '20 at 03:32
oh, replace the entire header? ok – sammywemmy May 13 '20 at 03:33

CypherX · Answer 3 · 2020-05-13T05:08:20.617

Use regex library.

A. Replacing the whole header

import pandas as pd
import re

headers = df.columns.to_list()
reference = dict_of_elements
pattern = '|'.join(sorted(reference.keys()))
pattern = re.compile(pattern)

result = []
for e in headers:
    ee = re.findall(pattern, e)
    if len(ee)==0:
        result.append(e)
    else:
        result.append(reference.get(''.join(ee).strip()))
result
df.columns = result
df

B. Replacing specific Chemical's name(not the whole header)

If you want to remove only the specific key by it's value: Silver by Ag

import pandas as pd
import re

text = '\n'.join(df.columns.to_list())
reference = dict_of_elements
pattern = '|'.join(sorted(reference.keys()))
pattern = re.compile(pattern)
res = re.sub(pattern, lambda m: reference[m.group()], text)
df.columns = res.split('\n')
df

Dummy Data

As provided by the OP.

dict_of_elements={"Sulphate":"SO4","Silver":"Ag","Aluminium":"Al","Americium":"Am","Argon":"Ar","Arsenic":"As","Astatine":"At","Gold":"Au","Boron":"B","Barium":"Ba","Beryllium":"Be","Bohrium":"Bh","Bismuth":"Bi","Berkelium":"Bk","Bromine":"Br","Calcium":"Ca","Cadmium":"Cd","Cerium":"Ce","Californium":"Cf","Chlorine":"Cl","Curium":"Cm","Copernicium":"Cn","Cobalt":"Co","Chromium":"Cr","Caesium":"Cs","Copper":"Cu","Dubnium":"Db","Darmstadtium":"Ds","Dysprosium":"Dy","Erbium":"Er","Einsteinium":"Es","Europium":"Eu","Fluorine":"F","Iron":"Fe","Flerovium":"Fl","Fermium":"Fm","Francium":"Fr","Gallium":"Ga","Gadolinium":"Gd","Germanium":"Ge","Hydrogen":"H","Helium":"He","Hafnium":"Hf","Mercury":"Hg","Holmium":"Ho","Hassium":"Hs","Iodine":"I","Indium":"In","Iridium":"Ir","Potassium":"K","Krypton":"Kr","Lanthanum":"La","Lithium":"Li","Lawrencium":"Lr","Lutetium":"Lu","Livermorium":"Lv","Mendelevium":"Md","Magnesium":"Mg","Manganese":"Mn","Molybdenum":"Mo","Meitnerium":"Mt","Nitrogen":"N","Sodium":"Na","Niobium":"Nb","Neodymium":"Nd","Neon":"Ne","Nickel":"Ni","Nobelium":"No","Neptunium":"Np","Oxygen":"O","Osmium":"Os","Phosphorus":"P","Protactinium":"Pa","Lead":"Pb","Palladium":"Pd","Promethium":"Pm","Polonium":"Po","Praseodymium":"Pr","Platinum":"Pt","Plutonium":"Pu","Radium":"Ra","Rubidium":"Rb","Rhenium":"Re","Rutherfordium":"Rf","Roentgenium":"Rg","Rhodium":"Rh","Radon":"Rn","Ruthenium":"Ru","Sulfur":"S","Antimony":"Sb","Scandium":"Sc","Selenium":"Se","Seaborgium":"Sg","Silicon":"Si","Samarium":"Sm","Tin":"Sn","Strontium":"Sr","Tantalum":"Ta","Terbium":"Tb","Technetium":"Tc","Tellurium":"Te","Thorium":"Th","Titanium":"Ti","Thallium":"Tl","Thulium":"Tm","Uranium":"U","Ununoctium":"Uuo","Ununpentium":"Uup","Ununseptium":"Uus","Ununtrium":"Uut","Vanadium":"V","Tungsten":"W","Xenon":"Xe","Yttrium":"Y","Ytterbium":"Yb","Zinc":"Zn"}

data = {'Silver ICP-MS': [1,2,3], 
        'Aluminium(III)': [4,5,6], 
        'Sulphate-LECO': [7,8,9], 
        'NOT a match': [10,11,12]}
df = pd.DataFrame(data)

References

can you write a str.replace() using dictionary values in Python?

Thanks, the header result should only contain the symbol and none of the other text. — flashliquid, May 13 '20 at 04:23
Interesting that in the case of Sulphate it has replaced the whole header (Section B) when it should have retained the "-LECO" part. — flashliquid, May 13 '20 at 07:00

Convert pandas heading string containing chemical name to chemical symbol

3 Answers3

A. Replacing the whole header

B. Replacing specific Chemical's name(not the whole header)

Dummy Data

References