Parsing a string for a start and end in Java

Question

I was having trouble finding any documentation at all on a type of parsing I need to do for a Java string.

So, It's not something simple like parsing by lines or commas or something, it's a bit more complicated.

My program grabs a web page's source, and I need to parse it for the content of a view tags.

Something like parsing it for what's between

<input name="sid" type="hidden" value="

and

" />

So, if the web page had this string:

<input name="sid" type="hidden" value="stringvaluehere" />

It would output

stringvaluehere

Can anyone help? I've found no sort of documentation on anything like this at all, and asking around at other sources has been no help.

Thanks!

Why? Scraping the web is almost always more trouble than it's worth. — Mike G, Dec 12 '12 at 00:55
You could try to use the java xml parser. Take a look at saxparser in javax.xml.parsers.SAXParser — Erik, Dec 12 '12 at 01:00

score 5 · Answer 1 · edited May 23 '17 at 11:52

5

If you want to parse HTML, I would suggest using an HTML parser rather than using String operations. Parsing the document as a String is just asking for problems when you run into strange input that you weren't expecting.

This question has some discussion of good potential Java HTML Parsers: Java HTML Parsing

edited May 23 '17 at 11:52

Community

1
1

answered Dec 12 '12 at 00:57

Jon7

7,165
2
33
39

Thanks! I'm most likely going to use the JSoup library, but this link was still extremely helpful. – N01zii Dec 12 '12 at 01:14

score 5 · Accepted Answer · answered Dec 12 '12 at 01:00

5

You could use a library for this, such as JSoup. It's often much easier than trying to parse the DOM manually.

Document doc = Jsoup.connect("http://www.example.com").get();
Elements inputs = doc.select("input#sid");
for(Element input : inputs) {
    System.out.println(input.attr("value"));
}

Simple to use & importantly easy to read.

answered Dec 12 '12 at 01:00

anotherdave

6,656
4
34
65

Oh wow, that does seem extremely simple. Thanks a ton for the help, I'll most likely be using that library for things of this sort from now on! – N01zii Dec 12 '12 at 01:14
Oh, one more question: The "#sid" part of grabbing the input value didn't seem to work. When I just leave it as plain input it dumps all of the input values on the page. Do you know if there's any way I can narrow it down to just one value, by name or something? I tried every way possible that I could think of, but I couldn't find it online either. – N01zii Dec 12 '12 at 01:59
Sorry, just saw your comment! Thought above that you were using an ID of `sid`, but you're actually using a name atrribute. If you use an ID, you could use the selector with the hash (pound) sign, but if you want to stick to a name, you should use `input[name=sid]`. Just to note that for good accessibility, `input` elements should all use an ID too. (e.g. `Test` — `bar` will be the named param passed, but the `foo` ID will associate it with its label). See http://jsoup.org/apidocs/org/jsoup/select/Selector.html for more CSS selectors. – anotherdave Dec 17 '12 at 14:23

MadProgrammer · Answer 3 · 2012-12-12T01:38:56.530

This is a little heavy handed and there is a probably really cool and whacky regexp that will do this better, but...

String value = "<input name=\"sid\" type=\"hidden\" value=\"stringvaluehere\" />";
value = value.substring(value.indexOf("value=\"") + "value=\"".length());
value = value.substring(0, value.indexOf("\""));
System.out.println(value);

Prints stringvaluehere

Update

Another approach would have you treat the HTML text as XML and use the XML parser to find the attributes of the element. While it sounds complicated, it is by FAR a easier solution, especially if you tend to parse multiple web pages.

Two solutions that might help would be jsoup and Cobra

Thanks for that code snippit, and yeah I had been thinking there would be some weird regexp haha. Got this all finished now though, thanks for the JSoup recommendation! — N01zii, Dec 12 '12 at 01:36

score 1 · Answer 4 · answered Dec 12 '12 at 00:59

1

If the page is well-formed XML, you may use XPATH query language for this purpose. It is far more cleaner solution, than low-lvl regexp matching. Or some existing library for parsing HTML.

answered Dec 12 '12 at 00:59

Jiri Kremser

12,471
7
45
72

Parsing a string for a start and end in Java

4 Answers4