Java - HTML code: extract part of the tag

Question

I have to extract some integers from a tag of a html code. For example if I have:

< tag blabla="title"><a href="/test/tt123> TEST 1 < tag >

I did that removing all the chars and leaving only the digits and it worked until in the title name there was another digit, so i got "1231".

str.replaceAll("[^\\d.]", "");

How can I do to extract only the "123" integer?? Thanks for your help!

do you want to extract numbers only from `href` attribute..? — T J, Jun 04 '14 at 13:43
Carefull with that regexp, if you got something like this `href=\"/test-2/tt123\"` your value will be `2123` and not the `123` as expected — Patrick Ferreira, Jun 04 '14 at 13:52
Don't use regex to parse HTML (or XHTML). See [bobince's answer here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for *why*. Use a proper parser instead. — JonK, Jun 04 '14 at 14:07

score 1 · Answer 1 · answered Jun 04 '14 at 13:50

Jsoup is a good api to play around with html. Using that you could do like

String html = "<tag blabla=\"title\"><a href=\"/test/tt123\"> TEST 1 <tag>";
Document doc = Jsoup.parseBodyFragment(html);
String value = doc.select("a").get(0).attr("href").replaceAll("[^\\d.]", "");
System.out.println(value);

score 0 · Answer 2 · answered Jun 04 '14 at 13:51

You could do this (a method that removes all duplicates in any number):

int[] foo = new int[str.length];
for(int i = 0; i < str.length; i++) {
    foo[i] = Integer.parseInt(str.charAt(i));
}

Set<Integer> set = new HashSet<Integer>();

for(int i = 0; i < foo.length; i++){
  set.add(foo[i]);
}

Now you have a set where all duplicate numbers from any string are removed. I saw your last comment not. So this answer might not be very useful to you. What you could do is that the three first digits in the foo array as well, which will give you 123.

Ceiling Gecko · Answer 3 · 2014-06-04T14:06:26.127

First use XPath to parse out only the href value, then apply your replaceAll to achieve what you desired.

And you don't have to download any additional frameworks or libraries for this to work.

Here's a quick demo class on how this works:

package com.example.test;

import java.io.StringReader;

import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.xml.sax.InputSource;


public class Test {


    public static void main(String[]args){

        String xml = "<tag blabla=\"title\"><a href=\"/test/tt123\"> TEST 1 </a></tag>";

        XPath xPath = XPathFactory.newInstance().newXPath();

        InputSource source = new InputSource(new StringReader(xml));

        String hrefValue = null;
        try {
            hrefValue = (String) xPath.evaluate("//@href", source, XPathConstants.STRING);
        } catch (XPathExpressionException e) {
            e.printStackTrace();
        }

        String numbers = hrefValue.replaceAll("[^\\d.]", "");

        System.out.println(numbers);

    }

}

Java - HTML code: extract part of the tag

3 Answers3