0

I'm parsing a text file which has data in it.

Whenever is a text data, the data is inside quotes. Ex: " any text here "

The problem is that inside the data I can have quotes too, but they will ALWAYS be followed by another quote. Ex: " text, he said ""hello"" "

I've tried the following, with no sucess:

  "(.+?)"(?!") 

How can I define a REGEX that matches text data in that format?

P.S.: Don't know if it helps or not, but each type of data is separated by ;

Vinicius Seufitele
  • 885
  • 1
  • 6
  • 15
  • If I change to "(.+)"(?!") the example is perfectly parsed, but since it's greedy he expands itself till the next text data (if it exists). I need it to stop as soon as it finds a quote not followed by another quote. – Vinicius Seufitele Apr 27 '12 at 15:53
  • Could you please be more specific. What would you expect the outcome to be for the above example? – endy Apr 27 '12 at 16:11
  • I'm using Scala parser combinators, so I won't work on the String itself, I only need to know if it's well-formed (the string.matches should return true). – Vinicius Seufitele Apr 27 '12 at 17:08
  • That's a pretty important detail to leave out! – Old Pro Apr 28 '12 at 04:48

5 Answers5

1

Try this regex (not tested):

"([^"]|"")*"

EDIT: (didn't realize you didn't want to match the quotes themselves)

(?<=")([^"]|"")*(?=")
JoelFan
  • 37,465
  • 35
  • 132
  • 205
1

Referring to a previous post I made here you should be able to use something like:

(?:\"[^\"]*?\")*
Community
  • 1
  • 1
OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
0

This will split only on double quotes but it will also give you the data outside the quotes - hope this helps

public static void main(String[] args) {
    // TODO code application logic here
    Pattern p = Pattern.compile("[\"]{2}");

    String[] result1 = 
             p.split("\"\"A01 A02\"\" \"\"B01 B02\"\"");
    for (int i=0; i<result1.length; i++)
        System.out.printf("DATA: ]]%s[[\n", result1[i]);
    String[] result3 = 
             p.split("\"\"A21 \" A22\"\" STUFF \"\"B21 B22\"\"");
    for (int i=0; i<result3.length; i++)
        System.out.printf("DATA: ]]%s[[\n", result3[i]);       
}
A B
  • 4,068
  • 1
  • 20
  • 23
0

I you can make sure, that there is a character, which ins't part of the message, like ~, you can replace the "" with ~, make your matching, and in the end, convert ~ to "" back.

text.replaceAll ("\"\"", "~").
     replaceAll ("(\"[^\"]+)", "($1)").
     replaceAll ("~", "\"\"")

Theoretically.

Practically, I get the quotation markes matched at the beginning and at the end, so this text:

echo 'asdf " I say ""hello"" " foo " you say ""goodbye"" "baz' 

is translated to:

echo 'asdf (" I say ""hello"" )(" foo )(" you say ""goodbye"" )("baz' )

I can't find the error, but maybe the idea is useful.

user unknown
  • 35,537
  • 11
  • 75
  • 121
0

If you can be sure the input is well formed (does not have unbalanced quotes), then this works (and if it's not well formed, then what do you want to do?):

"(([^"]*?)((""[^"]*?)*?))"(?!")

It is a quote, followed by anything but a quote zero or more times, followed any number of groups consisting of a pair of double quotes followed by any number of non-quotes, and ending with a quote not followed by a quote.

If you're sure that each data ends with a "; then it gets a little easier

"(([^"]*?)((""[^"]*?)*?))";

but does the last one on the line end with a "; or just a quote?

With inspiration from JoelFan and OldCurmudgeon, this works and is a bit simpler:

"((?:[^"]|"")*)"

With each pattern, the data is in capturing group 1. So your code would be something like:

while (matcher.find()) {
    data = matcher.group(1);
    /* do whatever you want with the data such as replace '""' with '"' */
}

Of course, you have to escape the quotes in the patterns when writing them as Java Strings, so they end up looking like this in your code:

"\"(([^\"]*?)((\"\"[^\"]*?)*?))\"(?!\")"

or

"\"(([^\"]*?)((\"\"[^\"]*?)*?))\";"

or (what I would use in my code)

"\"((?:[^\"]|\"\")*)\""
Old Pro
  • 24,624
  • 7
  • 58
  • 106
  • I really liked your suggestion, but it doesn't work on all cases: This string, for instance, is not matched: """INSERT YOUR NAME HERE""" – Vinicius Seufitele Apr 27 '12 at 17:08
  • Just continuing, the data CAN end with a ; although if it's the last item it won't. Also, the regex can't consume that char, cause it's going to be parsed by the following regex. – Vinicius Seufitele Apr 27 '12 at 17:11
  • 1
    `"((?:[^"]|"")*)"` is a slight variation of the standard solution to this sort of problem. For pretty much any language that has string literals with escape sequences, the regex to match them will be something along the lines of `OPEN_QUOTE ( CHAR_THAT_NEEDS_NO_ESCAPE | ESCAPE_SEQUENCE ) * CLOSE_QUOTE` – Laurence Gonsalves Apr 27 '12 at 17:11
  • @Vinicius, my preferred choice tested fine on """INSERT YOUR NAME HERE""" when I tested it at [RegexPlanet](http://www.regexplanet.com/advanced/java/index.html). The first one did to. The second one obviously not, since there's no semi-colon. Please check your test! – Old Pro Apr 28 '12 at 04:55