Getting some data from HTML using regex

Question

I was trying to get some data from html. This is my code:

 public static void main(String[] args) {
        final String str = "<div class=\"b-vacancy-list-salary\">\n" +
                "            from 50 000\n" +
                "             to 70 000\n" +
                "             USD.\n" +
                "        </div>";
        System.out.println(Arrays.toString(getTagValues(str).toArray()));
    }


    static final String tag = "<div class=\"b-vacancy-list-salary\">\n";
    private static final Pattern TAG_REGEX = Pattern.compile(tag+"(.+?)</div>");

    private static List<String> getTagValues(final String str) {
        System.out.println(tag);
        final List<String> tagValues = new ArrayList<String>();
        final Matcher matcher = TAG_REGEX.matcher(str);
        while (matcher.find()) {
            tagValues.add(matcher.group(1));
        }
        return tagValues;
    }

It returns [], but not value. What's wrong?

It's a usually a bad idea to parse html with regex - see [this question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — , Aug 23 '13 at 15:54
@user2062950 by usually I assume you mean always for any reason. — Slater Victoroff, Aug 23 '13 at 16:05
@SlaterTyranus pretty much, unless you have a small amount of HTML with well known structure (that isn't going to change, which is rarely the case). the question I linked has some good discussion on the subject — , Aug 23 '13 at 16:10
@user2062950 Heh, yea, linked to the same question. I would still argue that using a low upkeep xml parser like lxml is still the right choice though. — Slater Victoroff, Aug 23 '13 at 17:34

score 1 · Answer 1 · answered Aug 23 '13 at 15:55

You can remove line feed.

The better way to parse HTML is to use DOM parser or Xpath.

E.g :

    public static void main(String[] args) {
      final String str = "<div class=\"b-vacancy-list-salary\">\n"
              + "            from 50 000\n"
              + "             to 70 000\n"
              + "             USD.\n"
              + "        </div>";
      System.out.println(Arrays.toString(getTagValues(str).toArray()));
    }
    static final String tag = "<div class=\"b-vacancy-list-salary\">";
    private static final Pattern TAG_REGEX = Pattern.compile(tag + "(.+?)</div>");

    private static List<String> getTagValues(final String str) {
      System.out.println(tag);
      final List<String> tagValues = new ArrayList<String>();
      final Matcher matcher = TAG_REGEX.matcher(str.replace("\n", ""));
      while (matcher.find()) {
        tagValues.add(matcher.group(1).trim());
      }
      return tagValues;
    }

score 1 · Answer 2 · answered Aug 23 '13 at 16:02

1

Instead of

private static final Pattern TAG_REGEX = Pattern.compile(tag+"(.+?)</div>");

use

private static final Pattern TAG_REGEX = Pattern.compile(tag+"([\\s|\\S]+?)</div>");

answered Aug 23 '13 at 16:02

jyotesh

330
5
17

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Slater Victoroff Aug 23 '13 at 16:05
@Tony Because HTML isn't a regular grammar; it's more complicated than one, and even an extended regular-expression matcher can't handle its structure. You need a recursive-descent parser. – chrylis -cautiouslyoptimistic- Aug 23 '13 at 16:29

score 0 · Answer 3 · answered Aug 23 '13 at 16:06

Try adding Pattern.DOTALL as the second parameter of Pattern.compile. This enables the dot in the pattern to match newlines. Not sure if this quite gives you what you want, but it may help you get started.

private static final Pattern TAG_REGEX = Pattern.compile(tag + "(.+?)</div>",
                                                         Pattern.DOTALL);

Javadoc on DOTALL is here

score 0 · Answer 4 · answered Aug 23 '13 at 16:07

0

.* is not include the new line. try this:

Pattern.compile(tag + "((.|\n)*)</div>");

answered Aug 23 '13 at 16:07

Loki

931
8
13

score 0 · Answer 5 · answered Aug 23 '13 at 16:10

0

You need to make the "." match newline characters, you can do this by putting "(?s)" at the front of your regular expression; so in your case, do Pattern.compile("(?s)" + tag + "(.+?)");

answered Aug 23 '13 at 16:10

user2711693

23
5

Getting some data from HTML using regex

5 Answers5