2

I am making a program that lets a user input a chemical for example C9H11N02. When they enter that I want to split it up into pieces so I can have it like C9, H11, N, 02. When I have it like this I want to make changes to it so I can make it C10H12N203 and then put it back together. This is what I have done so far. using the regular expression I have used I can extract the integer value, but how would I go about get C10, H11 etc..?

System.out.println("Enter Data");

Scanner k = new Scanner( System.in );
String input = k.nextLine();

String reg = "\\s\\s\\s";
String [] data;

data = input.split( reg );

int m = Integer.parseInt( data[0] );
int n = Integer.parseInt( data[1] );
tckmn
  • 57,719
  • 27
  • 114
  • 156
Joe24
  • 161
  • 6
  • 19
  • @BheshGurung don't be so sure... – Bohemian Nov 11 '12 at 22:43
  • i dont understand, what string are you trying to spilt with a space as delimiter ?? – PermGenError Nov 11 '12 at 22:44
  • You can do this in JavaScript by calling a function from the regex - see http://stackoverflow.com/questions/1742798/increment-a-number-in-a-string-in-with-regex - but this is Java... – DNA Nov 11 '12 at 22:45
  • @BheshGurung See? You were wrong. It's *easily* done with regex – Bohemian Nov 12 '12 at 01:08
  • @BheshGurung See my answer for the solution and some test code that demonstrates splitting correctly even with multi-letter chemical symbols like `Br` and with multi-digit numbers. – Bohemian Nov 12 '12 at 01:28

3 Answers3

4

It can be done using look arounds:

String[] parts = input.split("(?<=.)(?=[A-Z])");

Look arounds are zero-width, non-consuming assertions.

This regex splits the input where the two look arounds match:

  • (?<=.) means "there is a preceding character" (ie not at the start of input)
  • (?=[A-Z]) means "the next character is a capital letter" (All elements start with A-Z)

Here's a test, including a double-character symbol for some edge cases:

public static void main(String[] args) {
    String input = "C9KrBr2H11NO2";
    String[] parts = input.split("(?<=.)(?=[A-Z])");
    System.out.println(Arrays.toString(parts));
}

Output:

[C9, Kr, Br2, H11, N, O2]

If you then wanted to split up the individual components, use a nested call to split():

public static void main(String[] args) {
    String input = "C9KrBr2H11NO2";
    for (String component : input.split("(?<=.)(?=[A-Z])")) {
        // split on non-digit/digit boundary
        String[] symbolAndNumber = component.split("(?<!\\d)(?=\\d)");
        String element = symbolAndNumber[0];
        // elements without numbers won't be split
        String count = symbolAndNumber.length == 1 ? "1" : symbolAndNumber[1];
        System.out.println(element + " x " + count);
    }
}

Output:

C x 9
Kr x 1
Br x 2
H x 11
N x 1
O x 2
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • I deleted my comments. +1 for the proof that my comment was wrong. It's a good solution. – Bhesh Gurung Nov 12 '12 at 01:46
  • Possibly cleaner than my solution, but I'd be interested to see if there's any difference performance-wise... Also you might want to use a `Pattern` so you don't have to recompile the regex every time. – jrtc27 Nov 12 '12 at 07:44
2

Did you accidentally put zeroes into some of those formula where the letter "O" (oxygen) was supposed to be? If so:

"C10H12N2O3".split("(?<=[0-9A-Za-z])(?=[A-Z])");

[C10, H12, N2, O3]

"CH2BrCl".split("(?<=[0-9A-Za-z])(?=[A-Z])");

[C, H2, Br, Cl]
Adam Paynter
  • 46,244
  • 33
  • 149
  • 164
  • sorry, I think i did. Once it is extracted like this could I break in down either further so I could add 1 to C10 to make it C11? – Joe24 Nov 11 '12 at 23:03
  • +1 for the lookBehind - but this doesn't work for some combinations of two-letter chemical symbols e.g. CH2BrCl – DNA Nov 11 '12 at 23:11
  • @DNA: I think it should be fixed now. – Adam Paynter Nov 11 '12 at 23:24
  • 1
    Nice. I think you can simplify to just `(?<=.)(?=[A-Z])` ? – DNA Nov 11 '12 at 23:35
  • @Joe24: In that case, you may want to go with jrtc27's answer. This solution would require a subsequent regular expression to pull the number out of the token. – Adam Paynter Nov 12 '12 at 00:17
1

I believe the following code should allow you to extract the various elements and their associated count. Of course, brackets make things more complicated, but you didn't ask about them!

Pattern pattern = Pattern.compile("([A-Z][a-z]*)([0-9]*)");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
    String element = matcher.group(1);
    int count = 1;
    if (matcher.groupCount > 1) {
        try {
            count = Integer.parseInt(matcher.group(2));
        } catch (NumberFormatException e) {
            // Regex means we should never get here!
        }
    }
    // Do stuff with this component
}
jrtc27
  • 8,496
  • 3
  • 36
  • 68
  • That pattern will get the wrong result for CH4, for example - it should return [C, H4] but I think it will return [CH4]. Two-letter chemical symbols are always uppercase-lowercase. – DNA Nov 11 '12 at 23:05