1

I am trying to extract Viewstate value from an HTML page that could come in two different ways:

First:

Blah and bleh

Some tags and stuff here that can be ignored

some more garbage and now important stuff .. id="__VIEWSTATE" value="ThisIsWhatIWantToExtract" />

garbage

Second:

Blah and bleh

Some tags and stuff here that can be ignored

some more garbage and now important stuff .. |__VIEWSTATE|ThisIsWhatIWantToExtract|garbage

garbage

I was using .split('__VIEWSTATE') method before I realized it could come two ways. This is whhat I have tried:

(.*\"__VIEWSTATE\" value\=\"(.*)\" \/\>.*)|(.*\|__VIEWSTATE\\|(.*)\|.*)

It seems to working for the first case but doesnt work for second case.

Whats the most effiecient and right way to do this?

Community
  • 1
  • 1
  • 1
    obligatory http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 – smcg Sep 25 '13 at 14:18
  • You've got quite a confusing combination of escaped and unescaped characters in your string. It would be better to actually copy something that compiles, as is, in Java, escape values and all. – Bernhard Barker Sep 25 '13 at 14:20
  • You should compile RegExp and verify your RegExp (add assertion). – Marek R Sep 25 '13 at 14:24
  • 1
    If you're using [`find`](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#find()), the `.*` at the beginning and end are fairly pointless (and most likely makes things much slower in the case of non-matches). If you're using [`matches`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#matches(java.lang.String)), you can still extract both pairs of `.*` at the start of the brackets and put them outside (i.e. `.*(\"...\>|\|...\|).*`). – Bernhard Barker Sep 25 '13 at 14:25
  • 1
    @MarekR I was testing regex online on [link](http://java-regex-tester.appspot.com/). Maybe thats the reason I did not get compilation errors thrown at me. I should be careful next time. –  Sep 25 '13 at 14:32
  • Ok but when you put this to java literal, you have to escape quote characters and backslashes. Remember that in java string literal backslash has spatial meaning and in RegExp it has spatial meaning. – Marek R Sep 25 '13 at 14:41

2 Answers2

2

You have messed up back slashes. Some are missing (required by RegExp and Java string literal) some are not needed (put before not spatial character in regexp):

".*(\"__VIEWSTATE\" +value=\"([^\"]*)\" */>|\\|__VIEWSTATE\\|([^|]*)\\|).*"

Then capture number 2 or 3 is your result.

Marek R
  • 32,568
  • 6
  • 55
  • 140
0

Extract the target content via replaceAll():

String stuff = str.replaceAll(".*__VIEWSTATE(\"\\s*value=\"|\\|)([^\"|]+).*", "$2");
Bohemian
  • 412,405
  • 93
  • 575
  • 722