1

I've been looking for a simple way to find quoted strings of text within a Java source code file. First, I looked to regular expressions. Then I realized I had two problems, because as this answer stated, there isn't going to be a totally correct regex for this, similar to the problems that arise with markup languages. The main issue comes from the fact that there may be escaped quotation marks within a string.

So, what options do I have for parsing a source code file to find strings (possibly with escaped quotations) within? Is there anything that already exists for doing this? Preferably, it would be in Python.

EDIT: Here's some oversimplified example code.

private static String[] b = {
    foo("HG@\"rND"),
    foo("K1\\"),
    bar("ab\\\\\\\"")
}

Any combination of backslashes should be able to be handled. The desired output would be the strings themself.

Community
  • 1
  • 1
Trey Keown
  • 1,345
  • 3
  • 16
  • 23

4 Answers4

1

You can use something like this:

import re

with open('input.java') as jfile:
    text = "".join(x.strip() for x in jfile)
m = re.findall(r'".*?(?<!\\)"', text)
for x in m:
    print x

But it is also necessary to remove comments, which is not extremely difficult. Or look at a Java parser.

perreal
  • 94,503
  • 21
  • 155
  • 181
  • Thanks for the parser link, that's great. Unfortunately I imagine that would be slower than glancing at a file and looking for strings only. There are a few hundred decompiled .java files that I'm looking through, and each one is quite large. Being fast and lightweight is key. – Trey Keown Jan 27 '14 at 19:29
1

Detect the escape sequence and quotes combination \" and replace it with some other combination. Its simple then extracting other stuffs inside the quotes

gzix
  • 271
  • 3
  • 20
  • foo("K1\\") would fail under this condition – Trey Keown Jan 24 '14 at 06:29
  • First replace the even number of \\ with some string. Then you will be left out with only single escape sequence. Then detect for \" – gzix Jan 24 '14 at 06:33
  • Good call. It would be possible to use, for any arbitrary valid string, an invalid escape sequence of something like \(quote) and \(backslash) instead, run the regex, and replace those with the correct values. – Trey Keown Jan 24 '14 at 06:39
  • I would always replace \\ with @~ and after finishing, will replace again with \\ – gzix Jan 24 '14 at 06:51
  • But there's no guarantee that specific string doesn't show up somewhere else. I'm dealing with a file that has all its strings encrypted via some odd XOR scheme, and I wouldn't be at all surprised if that showed up somewhere. Better to err on the side of caution with an invalid escape sequence. – Trey Keown Jan 24 '14 at 06:58
  • Then this would surely help you. use [^\\\](?:[\\\\\]{2})*\\\" It first checks for a character or space other than escape sequence and then matches 0 or even number of escape sequence and then a single escape sequence continued by " – gzix Jan 25 '14 at 05:45
1

What about writing a simple state machine? A simple example (with only double-quoted strings) could be:

STATE_OUTSTRING = 0
STATE_INSTRING = 1
STATE_INSTRINGBACKSLASH = 2

def getstrings(text):
    state = 0
    strings = []
    curstring = None
    for c in text:
        if state == STATE_OUTSTRING:
            if c == '"':
                state = STATE_INSTRING
                curstring = ""
        elif state == STATE_INSTRING:
            if c == '\\':
                state = STATE_INSTRINGBACKSLASH
            elif c == '"':
                state = STATE_OUTSTRING
                strings.append(curstring)
                curstring = None
            else:
                curstring += c
        else: # STATE_INSTRINGBACKSLASH
            curstring += c
            state = STATE_INSTRING
    return strings

You could add states like STATE_INCOMMENT, for example, if needed.

Pierre
  • 6,047
  • 1
  • 30
  • 49
0

Since this is a simple one, you're probably looking for something more advanced than

("(?:\\"|.)*")

Expl.: The \\" will eat up any escaped quotes, otherwise match any number of characters between two quotes.

Haven't tried the other answers, so there may already be a correct answer here, but anyway...

Regards

Edit: Fix for "flaw"??? Simply "eating" all escaped backslashes seems to do the trick:

("(?:\\"|\\\\|.)*?")

Edit again ;) :

Even better I think - "eat" all escaped characters:

("(?:\\.|.)*?")
SamWhan
  • 8,296
  • 1
  • 18
  • 45
  • There's a flaw in it... It won't handle escaped backslashes correctly. I.e. `foo(bar("K1\\"),"");` won't be parsed correctly. I'll get back if I find a solution. – SamWhan Jan 24 '14 at 10:27