0

I like to get html content using regular expressions. I have problems when the content is written in multiple lines. No matches are found. Here is the regular expression that I use:

String regExpContent = "<div class=\"views-field views-field-body\">(\\s+)<span class=\"field-content\">([\\:\\,\\w\\s\\.\\„\\”\\-\\(\\)0123456789(&nbsp;)(\r?\n)]+)</span>(\\s+)</div>";
Pattern regExpMatcherContent = Pattern.compile(regExpContent,
            Pattern.DOTALL | Pattern.UNICODE_CHARACTER_CLASS);

I use (\r?\n) to match new line. Can anybody help me?

vikifor
  • 3,426
  • 4
  • 45
  • 75

2 Answers2

1

Please use an HTML parser.

String html = "<div class=\"views-field views-field-body\">...</div>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

Elements fieldContent = body.select("div.views-field-body span.field-content");

The use of regex for parsing HMTL has been discouraged so often that I won't repeat any of the arguments here. Suffice it to say that you really should not do it.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 1
    Then you should use jSoup nevertheless and explain to your teacher that real developers don't parse HTML with regular expressions. – Tomalak Jun 30 '13 at 06:43
0

The problem is that you are using regex to parse html.You should use an html parser.


To answer your question

Your Pattern.DOTALL is redundant because you are not using . anywhere in your regex

\s in your regex would match newlines because it is similar to [\r\n\t ]

The problem is with your [\\:\\,\\w\\s\\.\\„\\”\\-\\(\\)0123456789(&nbsp;)(\r?\n)]+..It should ([:,\\w\\s.„”()-]|&nbsp;)+

Community
  • 1
  • 1
Anirudha
  • 32,393
  • 7
  • 68
  • 89