How to Find Character After some Specific String? [Chemistry Code]

Question

I'm trying to write a code where I enter a molecule (only using those that include C, H, O, Cl and N for convenience) for input and get the molecular weight.

Like if I enter C2H4, it should do the calculation:

MW = mass_C * 2 + mass_H *4

I need the code to take the letter(s) before the number, then multiply it with the number following the "string" until it reaches the end.

So basically, how do I get the letters before a number?

PS: I'm new to coding :) so would be nice to see written code instead of just explanation to understand the format I should use for RE's.

Hi. See if this is what you are looking for https://stackoverflow.com/questions/23602175/regex-for-parsing-chemical-formulas — Janitha, Jun 06 '19 at 18:55
Possible duplicate of [Extract Number before a Character in a String Using Python](https://stackoverflow.com/questions/36167442/extract-number-before-a-character-in-a-string-using-python) — m13op22, Jun 06 '19 at 18:55

score 2 · Answer 1 · answered Jun 06 '19 at 19:20

Here is some simple code that does what you need:

import re                                                                      

atomic_weights = {                                                             
    'Cl': 35.446,                                                              
    'C': 12.0096,                                                              
    'O': 15.99903,                                                             
    'H': 1.00784,                                                              
}                                                                              

pattern = r"(?P<element>" + "|".join(atomic_weights.keys()) + ")(?P<count>\d*)" 
expression = re.compile(pattern)                                               

while True:                                                                    
    formula = input("formula: ")                                               
    weight = 0                                                                 
    for match in expression.finditer(formula):                                 
        element = match.group('element')                                          
        count = match.group('count')                                           
        if count.isdigit():                                                    
            count = int(count)                                                 
        else:                                                                  
            count = 1
        weight += atomic_weights[element] * count                          
    print(f"{formula} weighs {weight}")

I'll only focus on the use of regular expressions:

first we compile a pattern using the following special characters:
- | matches everything before or after it
- \d matches everything that is a digit
and the following qualifier:
- * which matches zero or more (we don't necessarily need a number)
and named groups:
- (?P<name>) names whatever is matched so it can easily be referenced to again later

Then we match the formula against the regular expression we created and loop over all the matches to find the corresponding element weight and multiplier.

Little note: this method will completely ignore elements that it doesn't know the weight of. That's probably not what you want...

Hope this helps!

Thank you, I don't know some of the syntax here but I will search it :) — selinoktay, Jun 06 '19 at 19:37
@selinoktay The "join" syntax for the regular expression is particularly useful, since that allows you to just change a line in atomic_weights and nothing else. The 'r' means a raw string, useful for regex. re: this code ignoring elements, it seems like that could be fixed without too much trouble. You could set a check_string=formula, then in the "for" loop, remove each match group. If there is something left, have the program say it didn't catch everything. I think that'd be a good exercise for the original poster. — aschultz, Jun 06 '19 at 20:08
@aschultz thanks for the explanation the "r" kinda bugged me but I get it now. — selinoktay, Jun 07 '19 at 08:27

score 0 · Answer 2 · answered Jun 06 '19 at 19:06

I'm not sure I have the atomic weights right, but I guess that can be fixed.

The tricky bit for me was trying to figure out cases like CH3, where the simple (letters)(numbers) regex doesn't work.

re.findall does the heavy lifting here. There may be a better way to parse the C2H4 string, and I'd be interested in it, but this works. Obviously you can clean things up and make neater functions, etc.

But the regex here, which I suspect you are most interested in, says: look for a string of letters, upper or lower case, then a string of numbers. That is passed to calc_weight, which separates the string into letters and numbers. The letters are sent to an atomic weight, if available. If not, an error is thrown. Then the weight is multiplied by the number.

import re
import sys

weight = { 'cl': 30, 'n': 8, 'o': 12, 'c': 6, 'h': 2 }

def calc_weight(my_str):
    elt = my_str[1].lower()
    if not re.search("[0-9]", my_str[0]): amt = 1
    else: amt = re.sub("^[a-zA-Z]+", "", my_str[0])
    if elt not in weight: sys.exit(elt + " is not a valid element.")
    return int(amt) * weight[elt]

my_string = "C2H4"

a = re.findall("((Cl|H|O|C|N)[0-9]*)", my_string)

my_weight = 0
for b in a:
    my_weight += calc_weight(b)

print("Weight of", my_string, "is", my_weight)

A word on the code: my_str[0] and my_str[1] are part of a tuple from findall, because I have two pairs of parentheses. The first is the overall string, and the second is the element.

Hope this helps. Note you can probably improve on the code: throw a better error message for a bad string, etc. But I wanted to at least allow for capitalization e.g. if someone typed Mg or MG it should not make a difference.

if you want to allow for capitalization, you should have a look at case insensitive regular expressions (https://docs.python.org/3/library/re.html#re.IGNORECASE) — Bart Van Loon, Jun 06 '19 at 19:23
Thanks, the findall function is extremely helpful :) didn't think about using that. — selinoktay, Jun 06 '19 at 19:38

How to Find Character After some Specific String? [Chemistry Code]

2 Answers2