0

I'm looping through a load of HTML and I'm trying to just extract the parts I need.

I need to just get 'THISISTHEBITIWANT' from the html below.

<li class="aClass">
  <a href="example/THISISTHEBITIWANT">example</a>
</li>

<li class="aClass">
  <a href="example/THISISTHEBITIWANT">example2</a>
</li>

Each time I only want to get the 'THISISTHEBITIWANT' and the text in the link will change. I've looked at string replace - but as I don't know what 'example' or 'example2' is going to be each time, I can only remove up until 'example/' at the moment.

This was my Java code:

html = inputLine.replace("<li class=\"aClass\"><a href=\"/example/", "");

If anyone could offer any advice, it would be much appreciated!

matt
  • 285
  • 4
  • 17

1 Answers1

0

While the standard way for processing HTML would be to use an HTML parsing library, as the two comments suggest, if you are really only interested in getting the bit you want out, it may suffice to use a regular expression.

import java.util.regex.*;


public class Regular{
    public static void main(String[] args) {
        String original =  "<li class=\"aClass\">\n<a href=\"example/THISISTHEBITIWANT\">example2</a>\n</li>";
        Pattern mypattern = Pattern.compile("<li class=\"aClass\">\\s+<a href=\"example/([^\"]+)\"");
        Matcher matcher = mypattern.matcher(original);
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }
}
merlin2011
  • 71,677
  • 44
  • 195
  • 329