5

I'm trying to make a system to factorise (if that term is correct in Chemistry) a given expanded chemical formula, such as C6H2NO2NO2NO2CH3 into brackets so that it is C₆H₂(NO₂)₃CH₃. (Ignore the subscript, or lack thereof in the first instance). Problem is, I don't know what the repeated molecule is going to be, or even how long it will be. How would I find and count repeats?

For context, here's my code so far which generates the formula from a 2D list of elements:

private String getFormula(List<List<Element>> elements)
{
    String formula = ""; //TODO Switch out for StringBuilder

    for(List<Element> currentElement : elements)
    {
        formula += currentElement.get(0).getSymbol(); //Every element per list is identical, so looking at 0 will always be safe
        if(currentElement.size() > 1) formula += currentElement.size(); //Only display a number if there is more than 1 element
    }

    return formula;
}
Jake Stanger
  • 449
  • 1
  • 8
  • 24
  • So is the input `elements` already separated into C6, H2, NO2, or are you trying to parse the String `C6H2NO2NO2NO2CH3`? What is an example of the List>? Where is the definition of Element -- does it have a .equals() method for example? And it has been a while, but NO2 isn't an element, it's compound, non? In essence, I would probably create a TreeMap (or LinkedHashMap) to preserve order, traverse the entire input, and then traverse the Map adding parenthesis and count around the element if the value was > 1. – KevinO Mar 29 '16 at 19:40
  • elements is already separated, yes. I'm trying to parse the string, which this code generates. NO2 is indeed a compound, so the work done adding brackets must purely use the string. The problem with a map is that it cannot have duplicate keys, and I must also store this to disk using a standard map (Minecraft's NBT system). It could work for a temporary conversion nonetheless... – Jake Stanger Mar 29 '16 at 19:45
  • I think it would be helpful if you would provide more examples of expected input and output. In one comment below you mentioned use case `H3B3U7H3B3U7H3B3U7H3NB3O2U7U7H3H3B3B3U7U7H3H3B3B3U7U7H3C6H2NO2NO2NO2CH3H3B3B3U7‌​`, which should output `C6H2(NO2)3CH3(U7H3B3)10`. This example violated a lot of assumptions I had made for this problem. For example, the substring `U7H3B3` isn't always exactly in that order in the original, nor are there 10 copies of it in that exact form. This exponentially complicates the problem. – mellamokb Mar 29 '16 at 20:56
  • In other words, you need to design a pseudo-code algorithm *first* that illustrates how you would handle this example long before you write the first line of code. We need to back up and evaluate the scope of possible inputs and outputs and how they should be interpreted. – mellamokb Mar 29 '16 at 20:58
  • 2
    @mellamokb Ignore that example, basically. It seems that the input generated from the elements I put in was muddled, so the output was extra muddled. I'll produce some more examples and edit my original post in a minute... – Jake Stanger Mar 29 '16 at 20:59

5 Answers5

2

This answer gives the code: String result = source.replaceAll("(.+)\\1+", "$1"), which replaces all repeated substrings.

I'm thinking that a slight modification of that code should do what you want. If you use "($1)" as the replacement, it will wrap the match in parenthesis. You could probably step through the replacement and determine what number should come after the parenthesis.

To prevent the regex from capturing preceding numbers, try "([A-Za-z]+[1-9]*)\\1+".

This link explains how to count number of matches. It's a bit more complex:

 Pattern pattern = Pattern.compile("([A-Za-z]+[1-9]*)\\1+");
    Matcher  matcher = pattern.matcher(YOUR_CHEM_STRING);

   int count = 0;
   String prior="";
    while (matcher.find()){
       if(m.group().equals(prior){
             count++;
       }else{
           YOUR_CHEM_STRING.replaceAll("([A-Za-z]+[1-9]*)\\1+","($1)"+count);
           count=0;
       }
    }
Community
  • 1
  • 1
Laurel
  • 5,965
  • 14
  • 31
  • 57
  • This has had a strange effect of putting the the brackets around the `NO` so that I get `C6H(2NO)2CH3`. Any idea what to do here (I've not really learnt how to use regexes properly yet)? I haven't counted the number of repeats yet either. – Jake Stanger Mar 29 '16 at 19:51
  • The regex sees `2NO2NO` and acts accordingly. I'll see about making it recognize there are letters then numbers. – Laurel Mar 29 '16 at 19:53
  • Works beautify. Accepted due to simplicity versus other answers. – Jake Stanger Mar 29 '16 at 19:57
  • Is there an easy way to count the number of repeats, otherwise I'm afraid I'll have to choose another answer (ooh, that sounds almost blackmail-y :P) – Jake Stanger Mar 29 '16 at 20:05
2

EDIT Updated so that it considers order. Thanks @JakeStanger!
2nd EDIT Updated to reflect new condition where molecule ends with |


I used a regular expression to split by | since from the given String we know that a new molecule starts after |. I used a Hashmap to keep track of how many molecules of each type were given. In the end I iterated through each value in the Hashmap and appended to String result depending on whether it was one molecule or not. Cheers!

   public static String factorise(String input) {
        String result = "";
        Map<String, Integer> molecules = new LinkedHashMap<>();
        String[] res = input.split("\\|");
        for (String t : res) {
            //Check if we already have this element in our map
            if (!molecules.containsKey(t)) {
                //If not then add it and set the count to 1
                molecules.put(t, 1);
            } else {
                //If we do then update the count by 1
                molecules.put(t, molecules.get(t) + 1);
            }
        }
        //Iterate through each molecule
        for (String key : molecules.keySet()) {
            if (molecules.get(key) == 1) {
                //If the count is only at one, then we just need to append it.
                result += key;
            } else {
                //Otherwise, we need the parentheces and the number of repetitions followed after
                result = result + "(" + key + ")" + molecules.get(key);
            }
        }
        return result;
    }

Running

    System.out.println(factorise("C6|H2|NO2|NO2|NO2|CH3|OH|OH"));
    System.out.println(factorise("HO|HO"));

Yields the following when run:

run:
C6H2(NO2)3CH3(OH)2
(HO)2
BUILD SUCCESSFUL (total time: 0 seconds)

robotlos
  • 536
  • 2
  • 10
  • Order is important, but a `LinkedHashMap` should sort that out. – Jake Stanger Mar 29 '16 at 19:55
  • @JakeStanger Thank you for the suggestion! Will incorporate it. – robotlos Mar 29 '16 at 19:57
  • This works very well. I'm having a few issues with compounds containing multiple repeats, but this appears to be an issue on my part with the actual compound creation system. – Jake Stanger Mar 29 '16 at 20:09
  • This one suffers from similar problem as the other answers, it doesn't handle repeated molecules that don't have numbers, or have numbers in the middle. For example `CaOHOH` or `C6H2OH2ONO` (I realize this may not be a valid molecule, just demonstrates the concept). – mellamokb Mar 29 '16 at 20:12
  • @mellamokb Hmmm, valid point. I'm not too chemistry savvy, but then at this point, how do we distinguish OH from O1H1 or OH1? Is this a valid question? Or better yet, OHOH from O2H2 and OH2. – robotlos Mar 29 '16 at 20:15
  • @mellamokb Well that explains the `(H3)10(B3)9(U7)10NB3O2C6H2(NO2)3CH3` I've got here then :P When you say it doesn't handle them, do you mean it ignores them or causes unwanted effects? – Jake Stanger Mar 29 '16 at 20:18
  • @robotlos, technically you should not have O1H1 or OH1, as 1 is not used (if I recall correctly). However, the CaOHOH is a valid example. As is the would be the example with the two water molecules. – KevinO Mar 29 '16 at 20:19
  • @JakeStanger What was your input that resulted on `(H3)10(B3)9(U7)10NB3O2C6H2(NO2)3CH3`, and what was the expected output? Let me see if I can come up with something. – robotlos Mar 29 '16 at 20:21
  • @JakeStanger, the input to the current code of "CaOHOH" results in an output of "CaOHOH" rather than "Ca(OH)2", so ignores seems appropriate. In essence, I think the issue is the assumption of a number as the diving point for an element/molecule in the String. BTW, is there no ability to modify the `getFormula(List>` to output the String correctly in the first place? At one point, all of the needed information was present, but parsing the String may be difficult. – KevinO Mar 29 '16 at 20:22
  • @JakeStanger If you could append a `1` at the end of each molecule that doesn't have a number that would solve it easily. Not sure if that's acceptable though. – robotlos Mar 29 '16 at 20:24
  • @robotlos the rather nasty `H3B3U7H3B3U7H3B3U7H3NB3O2U7U7H3H3B3B3U7U7H3H3B3B3U7U7H3C6H2NO2NO2NO2CH3H3B3B3U7` went in. I expected `C6H2(NO2)3CH3(U7H3B3)10`. Give me a second and I'll try running a more practical test :/ I think the actual compound generation is off though, looking at the input, which has lost its order... – Jake Stanger Mar 29 '16 at 20:24
  • @robotlos There's no reason why I can't add a `1` when factorising, then remove all `1`s afterwards – Jake Stanger Mar 29 '16 at 20:26
  • @JakeStanger You wouldn't have to remove the `1`s after factorizing since the method I wrote won't append `1`s. If you could feed the method I wrote with the molecules containing a `1` where it's needed then it will solve the issue I believe! – robotlos Mar 29 '16 at 20:28
  • @robotlos oh I see what you mean, give me a second to try that. EDIT: Expected `(OH)2`, got `(O1)2(H1)2` – Jake Stanger Mar 29 '16 at 20:29
  • @JakeStanger wait, just tried it and it does append `1`s. Let me see what I can do. – robotlos Mar 29 '16 at 20:30
  • @JakeStanger Okay, I added an `if` condition that will check if the molecule ends with `1` and will remove it only if the molecule's second to last `Character` is not a number. So in the case that you get something like A11 it'll treat it as a valid input. So this will work if you add a `1` before factorising. Cheers! – robotlos Mar 29 '16 at 20:41
  • @robotlos I tried what should be `(OH)2` again but got `OHOH` – Jake Stanger Mar 29 '16 at 20:45
  • @JakeStanger Was your input `OH1OH1`? Make sure you get the most updated version. I had to make a minor change. Running `OH1OH1` resulted in `(OH)2` for me. – robotlos Mar 29 '16 at 20:46
  • @robotlos bugger! You're going to hate me for this! What you've done is append a 1 to the molecule, but I can't do that because I don't know where the molecule starts and ends. I interpreted it as appending 1 to every *element*. Sorry 'bout that :( EDIT: Shall we move this conversation to chat before it gets any longer? – Jake Stanger Mar 29 '16 at 20:49
  • @JakeStanger Hmm, okay, that's the main issue with all the regexes here. It's hard to tell when a molecule starts and ends. Perhaps finding a way that will indicate where each molecule ends would really help out (if you're allowed to do this). – robotlos Mar 29 '16 at 20:55
  • @robotlos I could add a special 'element' with a mass of 0 and use it as a marker I suppose. – Jake Stanger Mar 29 '16 at 21:01
  • @JakeStanger So `OHOH` would look like `OH0OH0`? Would something like `A10` be valid? – robotlos Mar 29 '16 at 21:01
1

You count split up the formula elements into a list and then parse it, counting consecutive repeats. When the count is greater than 1 you will add it with parenthesis.

String formula = "C6H2NO2NO2NO2CH3";
Pattern p = Pattern.compile("[a-zA-Z]+[0-9]+");
Matcher m = p.matcher(formula);
List<String> parts = new ArrayList<String>();
while(m.find()) {
    parts.add(m.group());
}

String shrink = "";
int count = 0;
for(int i=0; i<parts.size(); i++) {
    count++;
    if(i+1 == parts.size() || !parts.get(i+1).equals(parts.get(i))) {
        if(count == 1)
            shrink += parts.get(i);
        else
            shrink += "("+parts.get(i)+")"+count;
        count = 0;
    }
}
System.out.println(shrink); // result = "C6H2(NO2)3CH3"

If you are able to send the elements list, try this:

public static String shortForumla(List<List<Element>> elements) {
    String shrink = "";
    int count = 0;
    for(int i=0; i<elements.size(); i++) {
        String symbol = elements.get(i).get(0).symbol();
        if(i+1 == elements.size() || !elements.get(i+1).get(0).symbol().equals(symbol)) {
            if(count == 1)
                shrink += symbol;
            else
                shrink += "("+symbol+")"+count;
            count = 0;
        }
    }
    return shrink;
}
Matthew Wright
  • 1,485
  • 1
  • 14
  • 38
  • 1
    I do not think this regex will work for other cases. Consider the input CaOHOH which should yield Ca(OH)2. The regex appears to assume a count after the element, which may be incorrect. – KevinO Mar 29 '16 at 19:53
  • In that case, would you be able to send the `List> elements` to another method instead of using the plain formula string? I've updated my answer if you can. – Matthew Wright Mar 29 '16 at 20:03
  • It would be possible, yes. Does this do what my code did as well then, and actually generate the whole formula in one? – Jake Stanger Mar 29 '16 at 20:19
  • It should but would need to changed a little because I just realized it doesn't include the element counts in the symbols. – Matthew Wright Mar 29 '16 at 20:22
  • I think my original answer would work if you add the `1`s to each element and removed them afterwards as suggested in robotlos answer. – Matthew Wright Mar 29 '16 at 20:31
1

Here is an alternative solution that uses simple string parsing to find matching repeated substrings.

public static String factorise(String input) {

    StringBuilder result = new StringBuilder();

    for (int start = 0; start < input.length(); start++) {
        char c = input.charAt(start);
        if (c >= '0' && c <= '9') {
            result.append(c);
            continue;
        }

        boolean foundRepeat = false;
        for (int end = start + 1; end <= input.length(); end++) {
            int length = end - start;
            if (end + length > input.length()) break;

            String sub = input.substring(start, end);
            String nextsub = input.substring(end, end + length);
            int nextpos = end + length;
            int count = 1;

            while (sub.equals(nextsub)) {
                count++;
                if (nextpos + length > input.length()) break;

                nextsub = input.substring(nextpos, nextpos + length);
                nextpos += length;
            }

            if (count > 1) {
                result.append("(" + sub + ")" + count);
                start += length * (count) - 1;
                foundRepeat = true;
                break;
            }
        }

        if (!foundRepeat) {
            result.append(c);
        }
    }

    return result.toString();
}

Examples:

  • Input: Output
  • C6H2NO2NO2NO2CH3: C6H2(NO2)3CH3
  • CaOHOH: Ca(OH)2
  • C6H2OH2ONO: C6(H2O)2NO
  • C6H2NO2NO2NO2CH3OHOH: C6H2(NO2)3CH3(OH)2
mellamokb
  • 56,094
  • 12
  • 110
  • 136
0

Here is a another solution using Map & ArrayList instead of regex. I inserted every molecule in a Map and incremented its key value every time it appeared in the formula and a ArrayList to maintain order of molecules.

public static void main(String [] audi){
        String s="C66H2NO2NO2NO2CH3";
        Map<String, Integer > Formula=new HashMap<String, Integer >();
        List E = new ArrayList();

        String t="";int c=0,l=s.length();
        for(int i=0;i<l;i++)
        {
            t=t.concat(""+s.charAt(i));
            if(Character.isDigit(s.charAt(i)))
            {
                if(((i+1)<l && !Character.isDigit(s.charAt(i+1)) )|| i==l-1)
                {
                    int count = Formula.containsKey(t) ? Formula.get(t) : 0;
                    Formula.put(t, count + 1);
                    if(!E.contains(t))
                    E.add(t);
                    t="";
                }
            }
        }
        //display
        for(int i=0;i<E.size();i++){
            if(Formula.get(E.get(i))>1)
                System.out.print("("+E.get(i) + ")"+Formula.get(E.get(i)));
            else
                System.out.print(E.get(i));
        }
}

Output: C6H2(NO2)3CH3

Rajeev Singh
  • 3,292
  • 2
  • 19
  • 30