-2

I was trying to get some data from html. This is my code:

 public static void main(String[] args) {
        final String str = "<div class=\"b-vacancy-list-salary\">\n" +
                "            from 50 000\n" +
                "             to 70 000\n" +
                "             USD.\n" +
                "        </div>";
        System.out.println(Arrays.toString(getTagValues(str).toArray()));
    }


    static final String tag = "<div class=\"b-vacancy-list-salary\">\n";
    private static final Pattern TAG_REGEX = Pattern.compile(tag+"(.+?)</div>");

    private static List<String> getTagValues(final String str) {
        System.out.println(tag);
        final List<String> tagValues = new ArrayList<String>();
        final Matcher matcher = TAG_REGEX.matcher(str);
        while (matcher.find()) {
            tagValues.add(matcher.group(1));
        }
        return tagValues;
    }

It returns [], but not value. What's wrong?

Tony
  • 485
  • 2
  • 7
  • 13
  • 1
    It's a usually a bad idea to parse html with regex - see [this question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) –  Aug 23 '13 at 15:54
  • which values you want to get from the html ? – David Hofmann Aug 23 '13 at 15:54
  • `from 50 000 to 70 000 USD` – Tony Aug 23 '13 at 15:56
  • @user2062950 by usually I assume you mean always for any reason. – Slater Victoroff Aug 23 '13 at 16:05
  • @SlaterTyranus pretty much, unless you have a small amount of HTML with well known structure (that isn't going to change, which is rarely the case). the question I linked has some good discussion on the subject –  Aug 23 '13 at 16:10
  • @user2062950 Heh, yea, linked to the same question. I would still argue that using a low upkeep xml parser like lxml is still the right choice though. – Slater Victoroff Aug 23 '13 at 17:34

5 Answers5

1

You can remove line feed.

The better way to parse HTML is to use DOM parser or Xpath.

E.g :

    public static void main(String[] args) {
      final String str = "<div class=\"b-vacancy-list-salary\">\n"
              + "            from 50 000\n"
              + "             to 70 000\n"
              + "             USD.\n"
              + "        </div>";
      System.out.println(Arrays.toString(getTagValues(str).toArray()));
    }
    static final String tag = "<div class=\"b-vacancy-list-salary\">";
    private static final Pattern TAG_REGEX = Pattern.compile(tag + "(.+?)</div>");

    private static List<String> getTagValues(final String str) {
      System.out.println(tag);
      final List<String> tagValues = new ArrayList<String>();
      final Matcher matcher = TAG_REGEX.matcher(str.replace("\n", ""));
      while (matcher.find()) {
        tagValues.add(matcher.group(1).trim());
      }
      return tagValues;
    }
Duffydake
  • 917
  • 7
  • 18
1

Instead of

private static final Pattern TAG_REGEX = Pattern.compile(tag+"(.+?)</div>");

use

private static final Pattern TAG_REGEX = Pattern.compile(tag+"([\\s|\\S]+?)</div>");
jyotesh
  • 330
  • 5
  • 17
0

Try adding Pattern.DOTALL as the second parameter of Pattern.compile. This enables the dot in the pattern to match newlines. Not sure if this quite gives you what you want, but it may help you get started.

private static final Pattern TAG_REGEX = Pattern.compile(tag + "(.+?)</div>",
                                                         Pattern.DOTALL);

Javadoc on DOTALL is here

ajb
  • 31,309
  • 3
  • 58
  • 84
0

.* is not include the new line. try this:

Pattern.compile(tag + "((.|\n)*)</div>");
Loki
  • 931
  • 8
  • 13
0

You need to make the "." match newline characters, you can do this by putting "(?s)" at the front of your regular expression; so in your case, do Pattern.compile("(?s)" + tag + "(.+?)");