Java get specific part of HTML

Question

I'm looping through a load of HTML and I'm trying to just extract the parts I need.

I need to just get 'THISISTHEBITIWANT' from the html below.

<li class="aClass">
  <a href="example/THISISTHEBITIWANT">example</a>
</li>

<li class="aClass">
  <a href="example/THISISTHEBITIWANT">example2</a>
</li>

Each time I only want to get the 'THISISTHEBITIWANT' and the text in the link will change. I've looked at string replace - but as I don't know what 'example' or 'example2' is going to be each time, I can only remove up until 'example/' at the moment.

This was my Java code:

html = inputLine.replace("<li class=\"aClass\"><a href=\"/example/", "");

If anyone could offer any advice, it would be much appreciated!

Take a look at this comparison of Java HTML parsers -http://stackoverflow.com/questions/3152138/what-are-the-pros-and-cons-of-the-leading-java-html-parsers — Dror Bereznitsky, Mar 31 '14 at 14:59

score 0 · Answer 1 · answered Mar 31 '14 at 19:17

While the standard way for processing HTML would be to use an HTML parsing library, as the two comments suggest, if you are really only interested in getting the bit you want out, it may suffice to use a regular expression.

import java.util.regex.*;


public class Regular{
    public static void main(String[] args) {
        String original =  "<li class=\"aClass\">\n<a href=\"example/THISISTHEBITIWANT\">example2</a>\n</li>";
        Pattern mypattern = Pattern.compile("<li class=\"aClass\">\\s+<a href=\"example/([^\"]+)\"");
        Matcher matcher = mypattern.matcher(original);
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }
}

Java get specific part of HTML

1 Answers1