Quick way to find a value in HTML (Java)

Question

Using regular expressions, what is the simplest way to fetch a websites HTML and find the value inside this tag (or any attribute's value for that matter):

<html>
  <head>
  [snip]
  <meta name="generator" value="thevalue i'm looking for" />
  [snip]

Mike Haboustak · Accepted Answer · 2008-08-28T01:23:33.443

Depends on how sophisticated of an Http request you need to build (authentication, etc). Here's one simple way I've seen used in the past.

StringBuilder html = new StringBuilder();
java.net.URL url = new URL("http://www.google.com/");
BufferedReader input = null;
try {
    input new BufferedReader(
        new InputStreamReader(url.openStream()));

    String htmlLine;
    while ((htmlLine=input.readLine())!=null) {
        html.appendLine(htmlLine);
    }
}
finally {
    input.close();
}

Pattern exp = Pattern.compile(
    "<meta name=\"generator\" value=\"([^\"]*)\" />");
Matcher matcher = exp.matcher(html.toString());
if(matcher.find())
{
    System.out.println("Generator: "+matcher.group(1));
}

Probably plenty of typos here to be found when compiled. (hope this wasn't homework)

What if the meta tag is commented out? This will still read it. Is that right? What if there are two spaces between meta and name? Or a tab? Or a newline? What if the word generator is not surrounded by quotes? Because of these issues and plenty more, I suggest not writing this yourself but finding a library that will do it for you. — Steve McLeod, Nov 22 '09 at 09:27

score 4 · Answer 2 · edited Dec 20 '18 at 12:56

Its amazing how noone, when addressing the problem of using RegEx with HTML, confronts the problem of HTML often NOT being well-formed, thus rendering a lot of HTML-parsers completely useless.

If you are developing tools to analyze webpages and its a fact that these are not well-formed HTML, the statement "Regex should never be used to parse HTML" og "use a HTML parser" is just completely bogus. Facts are that in the real world, people create HTML as they feel like - and not necessarily suited for parsers.

RegEx is a completely valid way to find elements in text, thus in HTML. If there are any other reasonable way to confront the problems the Original Poster has, then post them instead of referring to a "use a parser" or "RTFM" statement.

score 1 · Answer 3 · edited Dec 20 '18 at 12:56

1

You should be using XPath query.

It's as simple as getting value of /html/head/meta[@name=generator]/@value.

A good tutorial: Parsing an XML Document with XPath

edited Dec 20 '18 at 12:56

akash

22,664
11
59
87

answered Sep 26 '08 at 01:09

vrdhn

4,024
3
31
39

How do you suggest we execute XPath against Html, when Html is not Xml? You can't guarantee that Html can be loaded as an Xml document for XPath navigation. Now an Html DOM is a great tool for this, but RegEx works and is straight-forward. – Mike Haboustak Jan 31 '09 at 04:12
The example in the question is obviously XHTML and therefore XML, because it has a self-closing tag. – Ben James Nov 22 '09 at 09:39

score 0 · Answer 4 · answered Sep 19 '08 at 11:07

0

Strictly speaking you can't really be sure you got the right value, since the meta tag may be commented out, or the meta tag may be in uppercase etc. It depends on how certain you are that the HTML can be considered as "nice".

answered Sep 19 '08 at 11:07

Eek

1,060
8
7

score 0 · Answer 5 · answered Nov 22 '09 at 09:23

It depends.

If you are extracting information from a site or sites that are guaranteed to be well-formed HTML, and you know that the <meta> won't be obfuscated in some way then a reading the <head> section line by line and applying a regex is a good approach.

On the other hand, if the HTML may be mangled or "tricky" then you need to use a proper HTML parser, possibly a permissive one like HTMLTidy. Beware of using a strict HTML or XML parser on stuff trawled from random websites. Lots of so-called HTML you find out there is actually malformed.

Well-formed HTML is even more of a reason to try and use a proper parser instead of regex. Regex should never be used to parse HTML, period. — Ben James, Nov 22 '09 at 09:35

score 0 · Answer 6 · answered Aug 28 '08 at 01:22

You may want to check the documentation for Apache's org.apache.commons.HttpClient package and the related packages here. Sending an HTTP request from a Java application is pretty easy to do. Poking through the documentation should get you off in the right direction.

score 0 · Answer 7 · edited Dec 20 '18 at 12:57

0

I haven't tried this, but wouldn't the basic framework be

Open a java.net.HttpURLConnection
Get an input stream using getInputStream
Use the regular expression in Mike's answer to parse out the bit you want

edited Dec 20 '18 at 12:57

akash

22,664
11
59
87

answered Aug 28 '08 at 01:26

Paul Tomblin

179,021
58
319
408

Quick way to find a value in HTML (Java)

7 Answers7

Linked

Related