1

In the following scenario we have a String that is the raw HTML from any page (It can be as larger as you want) and we have to find some values (That HTML does not have any Id or clases)

In that large String with html code we have to extract some values and save them on variables, in this example the value of total of credits (60).

String response = "...
                   <BR>
                   <FONT COLOR="NAVY" FACE="ARIAL" SIZE="2">
                    <B>TOTAL CREDITS:</B>&NBSP; 60
                   </FONT>
                   <BR>
                    ..."

What is the best way to extract that value?.

What I do is to indentify a unique prefix, I cut the String at that point, and then I cut the sufix.

String value = response.split("TOTAL CREDITS:</B>&NBSP;")[1].split("</FONT>")[0].trim();

Is there a better way to do that?

Oscar Méndez
  • 937
  • 2
  • 13
  • 39
  • 6
    Don't use regex to parse / extract values from HTML – TheLostMind Feb 23 '18 at 08:42
  • Use a DOM (not short for dominatrix) parser. – Tim Biegeleisen Feb 23 '18 at 08:43
  • 2
    Mandatory link: https://stackoverflow.com/a/1732454/1393766. Also see [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/q/590747), [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/q/701166), – Pshemo Feb 23 '18 at 08:44

2 Answers2

2

There are specific API for parsing HTML files from java.

This link can be a good starting point https://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

If you are using maven you have to include a dependency

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.2</version>
</dependency>

Then, you can use this code as a starting point, as you see, with jsoup you load the document DOM as a document and then you can search dom elements with a similar approach as parsing xml files:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

Document doc;
try {

    // need http protocol
    doc = Jsoup.connect("http://google.com").get();

    // get page title
    String title = doc.title();
    System.out.println("title : " + title);

    // get all links
    Elements links = doc.select("a[href]");
    for (Element link : links) {

        // get the value from href attribute
        System.out.println("\nlink : " + link.attr("href"));
        System.out.println("text : " + link.text());

    }

} catch (IOException e) {
    e.printStackTrace();
}

Hope this helps

Francisco Valle
  • 613
  • 10
  • 10
  • 4
    While links are nice as *additional* resource, they can't be *only* resource. What would such answer be worth when resource link will change/break? It is better to put essential parts in the answer itself and leave link if someone wants to learn something more. Take a look at [Are answers that just contain links elsewhere really “good answers”?](https://meta.stackexchange.com/q/8231), [Your answer is in another castle: when is an answer not an answer?](https://meta.stackexchange.com/q/225370) – Pshemo Feb 23 '18 at 08:49
2

To reiterate what is in the comments: don't parse HTML with regex.

However, to answer your direct question of whether there is a better way to do it for some general string: yes, just use String.indexOf.

One problem with what you're doing right now is that you create lots of extra strings and arrays, that you immediately discard. So, you may as well not create them. The other problem is that String.split takes a regular expression as a parameter, so you need to take care that the prefix and suffix don't contain special characters (unless you actually want those special characters); you could simply quote them, using Pattern.quote to avoid this problem.

This:

String value = response.split("TOTAL CREDITS:</B>&NBSP;")[1].split("</FONT>")[0].trim();

is taking the portion of the string after the prefix, and before the suffix.

You can find where the prefix ends like this:

int endOfPrefix = response.indexOf(prefix) + prefix.length();

(you'd need to consider the case where prefix isn't in the string)

and the start of the suffix like this:

int startOfSuffix = response.indexOf(suffix, endOfPrefix);

(you'd need to consider the case where suffix isn't in the string). The endOfPrefix parameter may not be necessary; this just ensures that you don't find an occurrence of the suffix before the occurrence of the prefix.

Then just take the substring between them:

String value = response.substring(endOfPrefix, startOfSuffix);
Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • I understand that... I choose that sufix because when I read the string I notice that there is only one match, but the sufix, in this case () could be in many places. So cutting it at the prefix give me allways the value when I cut the sufix because it will be allways the first match. – Oscar Méndez Feb 23 '18 at 08:56