2

I'm trying to create a regex to tokenize a string. An example string would be.

"hello world" Alexandros Alex "I Am" Something

I need to get responce back:

hello world
Alexandros
Alex 
I am
Something

So to make it clear, tokenize with space but not words within quotes. If this is an easy regural expresion sorry in advance but i always strugle with these.

jlordo
  • 37,490
  • 6
  • 58
  • 83
Alexandros
  • 733
  • 2
  • 10
  • 24
  • Your expected response here seems to just be your expression without quotes. In that case, you could just do a replace using [\"] – Lorcan O'Neill Feb 06 '13 at 17:30
  • @LorcanO'Neill: No, look at _hello world_ and _I am_. – jlordo Feb 06 '13 at 17:34
  • @LorcanO'Neill OP wants 5 output strings / tokens for the input example, not just the `"expression without quotes"`. – Bernhard Barker Feb 06 '13 at 17:35
  • 1
    What happens with nested quotes? `"\"hello world\"" for example" Alexandros Alex "I Am" Something` – Richard JP Le Guen Feb 06 '13 at 17:37
  • What is your expected result for the string `@#$@#%234 jkher@#$` or the string `jhkasd "asdsad` (quote not closed)? – nhahtdh Feb 06 '13 at 18:06
  • His output does not make that obvious. Your explanation does. Would it have been that hard to show your output like this? hello world, Alexandros, Alex, I am, Something – Lorcan O'Neill Feb 06 '13 at 18:18
  • http://stackoverflow.com/questions/366202/regex-for-splitting-a-string-using-space-when-not-surrounded-by-single-or-double?rq=1 – Lorcan O'Neill Feb 06 '13 at 18:20
  • @Lorcan O'Neill, sorry, did not think it was not obvious, but you people have a point. – Alexandros Feb 07 '13 at 13:33
  • @nhahtdh i do not really care, i liked answer of matts, but something i forgot to mention is that i need to allow *,?,_ characters inside, seems like davidrac gave an easier to understand answer which also covers the last requirement that did not previously cover. – Alexandros Feb 07 '13 at 13:34

3 Answers3

2

You could try: \b(?:(?<=")[^"]*(?=")|\w+)\b. This will exclude the actual quotes from the matches.

import java.util.regex.*;
public class Test {
    public static void main(String...args) {
        String line = "\"hello world\" Alexandros Alex \"I Am\" Something";
        Pattern pattern = Pattern.compile("\\b(?:(?<=\")[^\"]*(?=\")|\\w+)\\b");
        Matcher matcher = pattern.matcher(line);
        while (matcher.find()) {
            System.out.println(matcher.group(0));
        }
    }
}

When executed, you get this output:

$ javac Test.java
$ java Test
hello world
Alexandros
Alex
I Am
Something
matts
  • 6,738
  • 1
  • 33
  • 50
  • +1 @matts working fine but a brief description of regex would be appreciated – exexzian Feb 06 '13 at 18:25
  • Hello matts, Something i just noticed, but forgot to mention, perhaps since you gave me an olmost perfect answer you can assist me on the following extra requirement words may have ?, *, _ inside so i.e. could be "hello * world?" Alexandros Alex "I Am *" Something and would like to get: hello * world? Alexandros Alex I Am* Something – Alexandros Feb 07 '13 at 13:24
  • @Alexandros The regex should already match special characters inside the quotes, though to match special characters in the non-quoted words, you'll have to change the `\\w` to a character class like `[\\w\\?\\*\\_]`, which will match any word character and any of the escaped special characters. – matts Feb 07 '13 at 16:09
  • How would you modify it to match single quotes instead of double? – Yuki1112 Dec 29 '19 at 15:50
  • @Yuki1112 replace the three double quotes in the regex with single quotes – matts Dec 30 '19 at 20:48
1

This regular expression will match either words or entire strings within quotes: "[^"]*"|\w*

You can create a matcher with this regex and just iterate through all the matches. You can find some sample code here

davidrac
  • 10,723
  • 3
  • 39
  • 71
0

If you want to split,you can do so by checking if " are balanced..

Now obviously if the space is between "" the number of " would not be even..This is what the below regex do

\s(?=(?:([^"]*"[^"]*"[^"]*)*|[^"]*)$)
Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • What he has in the question is a **sample** string. Your regex is only going to work on his sample string only. It will fail when there are more quoted strings. – nhahtdh Feb 06 '13 at 18:18
  • @nhahtdh have you even tested..it **works** even when there are nested `"`..don't assume..place your reason with **facts** not **if** – Anirudha Feb 06 '13 at 18:26
  • @nhahtdh i had missed `*`..works perfectly now..thx to point it out – Anirudha Feb 06 '13 at 18:33