Split a string by period, but string contains float numbers

Question

I have a string formed from names (w/o spaces) separated by periods. Each token (after a period) can start with a [a-zA-Z_] or a [ (and ends with a ]) or a $ (and ends with a $).

Examples:

House.Car.[0].Flower
House.Car.$something$
House.Car2.$4.45$.[0]
House.Car2.$abc.def$.[0]

So I need to split the string by period, but in the last two examples I DONT want to split the 4.45 (or abc.def). Anything surrounded by $ should not be splitted.

For the last two example I just want an array like that:

House
Car2
$4.45$ //fixed, thanks Sabuj Hassan
[0]

or

House
Car2
$abc.def$
[0]

I have tried to use regex, but I'm completely wrong.

I was just informed that after the closing $ there could be another string surrounded by < and > which can again contain dots which I should not split:

House.Car.$abc.def$<ghi.jk>.[0].bla

And I need to get it like:

House
Car
$abc.def$<ghi.jk>
[0]
bla

Thanks for your help.

So basically you don't want to split on dot if it is surrounded with digits from the both sides? Or cases like `$a.b$` are also possible? — Pshemo, Apr 06 '14 at 13:15
Actually, I don't want to split anything that is surrounded by $'s. I will update my question. SabujHassan and fge I really like your answers, I'm checking both now. — bomba6, Apr 06 '14 at 13:17

fge · Accepted Answer · 2014-04-06T13:32:37.420

2

You are better off collecting the results by "walking" the string to match with .find():

// Note the alternation
private static final Pattern PATTERN 
    = Pattern.compile("\\$[^.$]+(\\.[^.$]+)*\\$|[^.]+");

//

public List<String> matchesForInput(final String input)
{
    final Matcher m = PATTERN.matcher(input);
    final List<String> matches = new ArrayList<>();

    while (m.find())
        matches.add(m.group());

    return matches;
}

edited Apr 06 '14 at 13:32

answered Apr 06 '14 at 12:57

fge

119,121
33
254
329

Just out of curiosity, why did you make `PATTERN` a constant? Oh, and it should be `PATTERN = Pattern.compile("\\...]+"). – The Guy with The Hat Apr 06 '14 at 13:01
@TheGuywithTheHat well, because I only need to create it once ;) After that I can create as many `Matcher`s as I want from it ;) – fge Apr 06 '14 at 13:14
Do you actually need to use `\\.[^$]+`? Notice that `\\.` is not special case, it is one of normal characters matched with `[^$]`. – Pshemo Apr 06 '14 at 13:29
@Pshemo fixed; `normal` is now `[^.$]` – fge Apr 06 '14 at 13:33
OK, now it is correct :) It is just hard for me to see advantage of this solution over simple `\\$[^$]+\\$|[^.]+`. I will probably need to read book you mentioned in your [previous answer](http://stackoverflow.com/a/17043605/1393766). – Pshemo Apr 06 '14 at 13:39
I've updated my question, they asked me at work to add this feature. I will really appreciate it if you can take a look. Hope I'm not being rude (-: – bomba6 Apr 06 '14 at 13:57
@fge, you seem to be making this harder than it needs to be. We don't care if there's a period between the dollar signs. That's the whole point: the period is significant only when it's *not* enclosed in delimiters like `$$` or `<>`. – Alan Moore Apr 06 '14 at 14:02
I think `"\\$[^.$]+(\\.[^.]+)*>|\\$[^.$]+(\\.[^.$]+)*\\$|[^.]+"` answers my new question.. this seems complicated but works for me... – bomba6 Apr 06 '14 at 14:04
Maybe you should have a look at something like parboiled? – fge Apr 06 '14 at 14:08

Jerry · Answer 2 · 2014-04-06T13:22:06.890

1

It will be easier with Pattern/Matcher I believe. Raw regex:

\$[^$]+\$|\[[^\]]+\]|[^.]+

In code:

String s = "House.Car2.$4.45$.[0]";
Pattern pattern = Pattern.compile("\\$[^$]+\\$|\\[[^\\]]+\\]|[^.]+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
   System.out.println(matcher.group());
}

Output:

House
Car2
$4.45$
[0]

ideonde demo

edited Apr 06 '14 at 13:22

answered Apr 06 '14 at 13:07

Jerry

70,495
13
100
144

I've updated my question, I actually need those `$`s. – bomba6 Apr 06 '14 at 13:20
@bomba6 Well, that's an easy fix. You just need to remove the `replaceAll`. I'll edit. – Jerry Apr 06 '14 at 13:21

Pshemo · Answer 3 · 2014-04-06T15:32:15.013

If not using regex is an option then you can write your own parser which will iterate one time over all characters in your string, checking if character is inside $...$ , [...] or <...>.

when you find non . then you need to just add it to token you are building like any ordinary character,
same when you find . but it is inside previously mentioned "areas".
But if you find . and you are outside of these areas you need to split on it, which means adding currently build token to result and clearing it for next token.

Such parser can look like this

public static List<String> parse(String input){
    //list which will hold retuned tokens
    List<String> tokens = new ArrayList<>();

    // flags representing if currently tested character is inside some of
    // special areas 
    // (at start we are outside of these areas so hey are set to false)
    boolean insideDolar = false;          // $...$
    boolean insideSquareBrackets = false; // [...]
    boolean insideAgleBrackets =false;    // <...>

    // we need some buffer to build tokens, StringBuilder is excellent here
    StringBuilder sb = new StringBuilder();

    // now lets iterate over all characters and decide if we need to add them
    // to token or just add token to result list
    for (char ch : input.toCharArray()){

    // lets update in which area are we
        // finding $ means that we either start or end `$...$` area so 
        // simple negation of flag is enough to update its status
        if (ch == '$') insideDolar = !insideDolar; 
        //updating rest of flags seems pretty obvious 
        else if (ch == '[') insideSquareBrackets = true;
        else if (ch == ']') insideSquareBrackets = false;
        else if (ch == '<') insideAgleBrackets = true;
        else if (ch == '>') insideAgleBrackets = false;

        // So now we know in which area we are, so lets handle special cases
        // if we are handling no dot
        // OR we are handling dot but we are inside either of areas we need 
        // to just add it to token (append it to StringBuilder)
        if (ch != '.' || insideAgleBrackets|| insideDolar || insideSquareBrackets ){
            sb.append(ch);
        }else{// other case means that we are handling dot outside of special 
              // areas where dots are not separators, so now they represents place 
              // to split which means that we don't add it to token, but
              // add value from buffer (current token) to results and reset buffer
              // for next token
            tokens.add(sb.toString());
            sb.delete(0, sb.length());
        }
    }
    // also since we only add value held in buffer to list of tokens when we 
    // find dot on which we split, there is high chance that we will not add 
    // last token to result, because there is no dot after it, so we need to 
    // do it manually after iterating over all characters 
    if (sb.length()>0)//non empty token needs to be added to result
        tokens.add(sb.toString());

    return tokens;
}

and you can use it like

String  input = "House.Car2.$abc.def$<ghi.jk>.[0]";
for (String s: parse(input))
    System.out.println(s);

output:

House
Car2
$abc.def$<ghi.jk>
[0]

Split a string by period, but string contains float numbers

3 Answers3