0

I have this java string with xml info and I am trying to use java regex to filter out all the junk that is between the words to form a word enclosed in brackets, e.g. [DEFENDANT].

I want to go from this:

<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r>

</st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r>

</st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r>

<w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r>

<w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>

to this:

<w:p><w:r><w:t>[DEFENDANT CITY], [DEFENDANT STATE] [DEFENDANT ZIP]</w:r><w:r>

I have been testing with regex epression like (\[)<.+>+([A-Z ]+\]) on regexPlanet extensively to no avail.

Maroun
  • 94,125
  • 30
  • 188
  • 241
Yair
  • 68
  • 7

2 Answers2

4

Do not use Regex to parse XML. Just use the built in Java XML library.

Yishai
  • 90,445
  • 31
  • 189
  • 263
0

If it's all on a single line, like this:

<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r></st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>

Then this regex should work:

([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)

I have a working example here: RegExr

I could have grouped things a little better, but overall, it gets the job done, so you should be able to see it working.

Also, if it's not on a single line (if it's like it is in your example), then this would work:

([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)

You can see that on RegExr here.