3

Using Java and regex, I want to extract strings from a line of text. The text can be in following format -

  1. key1(value1) key2(value2)
  2. key1(value1) key2
  3. key1 key2(value2)
  4. key1 key2
  5. key1

Am successfully able to extract the keys and values when Type #1 is used where I can split the text using space and then use following pattern to extract keys

Pattern p = Pattern.compile("\\((.*?)\\)",Pattern.DOTALL);

A complicated code logic for counting the occurance of "(" and matching it with occurence of the space can be used for Case #2 and Case #3, however, the code becomes way too long. Multiple complication arise when spaces are present in values too because then, splitting text becomes problematic.

Is there a better regex splitting/holiding I can use for selective Cases depicted above?

Prasoon
  • 425
  • 1
  • 6
  • 18

2 Answers2

4

Consider the following powershell example of a universal regex.

(?<=^|[\s)\n])[\n]*([^(\n\s]*)([(]([^)\n]*)[)])?

Example

    $Matches = @()
    $String = 'key1(value1) key2(value2)
key3(value3) key3.5
key4 key5(value5)  GoofyStuff(I like kittens)
key6 key7 ForReal-Things(be sure to vote)
key8'
    Write-Host start with 
    write-host $String
    Write-Host
    Write-Host found
    ([regex]'(?<=^|[\s)\n])([^(\n\s]*)([(]([^)\n]*)[)])?').matches($String) | foreach {
        if ($_.Groups[1].Value) {
            write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'"
            if ($_.Groups[3].Value) {
                write-host "value at $($_.Groups[3].Index) = '$($_.Groups[3].Value)'"
                } # end if
            } # end if
        } # next match

Yields

start with
key1(value1) key2(value2)
key3(value3) key3.5
key4 key5(value5)  GoofyStuff(I like kittens)
key6 key7 ForReal-Things(be sure to vote)
key8

found
key at 0 = 'key1'
value at 5 = 'value1'
key at 13 = 'key2'
value at 18 = 'value2'
key at 27 = 'key3'
value at 32 = 'value3'
key at 40 = 'key3.5'
key at 48 = 'key4'
key at 53 = 'key5'
value at 58 = 'value5'
key at 67 = 'GoofyStuff'
value at 78 = 'I like kittens'
key at 95 = 'key6'
key at 100 = 'key7'
key at 105 = 'ForReal-Things'
value at 120 = 'be sure to vote'
key at 138 = 'key8'

Summary

  • (?<=^|[\s)\n]*) looks for the beginning of a key, each key is assumed to be at the start of the string, or right after a \n, "(", or space - (?<=^|[\s)\n]*). This might not work in Java as there is a bug/feature in how java handles lookarounds with undefined sizes. (see also)
  • (?<=^|[\s)\n]) looks for the beginning of a key, each key is assumed to be at the start of the string, or right after a \n, "(", or space - (?<=^|[\s)\n]). This look around seems to work in C# and Powershell

  • ([^(\n\s]*) returns all characters up to the next "(", \n, or \s

  • ([(]([^)\n]*)[)])? returns the value inside the parans if it exists

    The extra logic inside the loop tests the Matches array to validate that key name or value was found. In powershell the $Matches is automatically populated with all matching items from the string.

Community
  • 1
  • 1
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Note: This solution works in C#, but it is currently relying on a bug/feature in Java's implementation of regex. – nhahtdh May 01 '13 at 18:35
  • This was written and tested in powershell, would you be able to expand on the bug/feature to which you're referring? – Ro Yo Mi May 01 '13 at 18:49
  • 1
    Check this question: http://stackoverflow.com/questions/1536915/regex-look-behind-without-obvious-maximum-length-in-java The "official" document of Java regex is in Pattern class, which doesn't really describe in detail what is considered invalid for look-behind. Therefore, it is not clear whether this is a bug or a feature. – nhahtdh May 01 '13 at 18:53
  • That's an interesting defect with Java. I'll update this answer by removing the "*" from the lookaround – Ro Yo Mi May 01 '13 at 18:59
  • 1
    Didn't analyze your regex - but yes, it doesn't need the `*`. The `*` actually makes it equivalent to empty string. If I remember correctly, look-behind always try to match the shortest string. – nhahtdh May 01 '13 at 19:02
  • Thanks @Denomales I got the code to work in Java actually and faced no issues. – Prasoon May 01 '13 at 23:10
0

My suggestion would be:

Pattern p = Pattern.compile("(\\(?[^ \\n(]+\\)?)+"), Pattern.DOTALL);

Then, iterate over the sub-matches. If the first character is a paren, you know it's the value of the previous key; otherwise, it's a key. If it's a value, just strip the parens off using substring.

Adrian
  • 42,911
  • 6
  • 107
  • 99