Match new line using regular expressions java?

Question

I like to get html content using regular expressions. I have problems when the content is written in multiple lines. No matches are found. Here is the regular expression that I use:

String regExpContent = "<div class=\"views-field views-field-body\">(\\s+)<span class=\"field-content\">([\\:\\,\\w\\s\\.\\„\\”\\-\\(\\)0123456789(&nbsp;)(\r?\n)]+)</span>(\\s+)</div>";
Pattern regExpMatcherContent = Pattern.compile(regExpContent,
            Pattern.DOTALL | Pattern.UNICODE_CHARACTER_CLASS);

I use (\r?\n) to match new line. Can anybody help me?

score 1 · Answer 1 · answered Jun 30 '13 at 06:16

1

Please use an HTML parser.

String html = "<div class=\"views-field views-field-body\">...</div>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

Elements fieldContent = body.select("div.views-field-body span.field-content");

The use of regex for parsing HMTL has been discouraged so often that I won't repeat any of the arguments here. Suffice it to say that you really should not do it.

answered Jun 30 '13 at 06:16

Tomalak

332,285
67
532
628

1

Then you should use jSoup nevertheless and explain to your teacher that real developers don't parse HTML with regular expressions. – Tomalak Jun 30 '13 at 06:43

score 0 · Answer 2 · edited May 23 '17 at 10:32

The problem is that you are using regex to parse html.You should use an html parser.

To answer your question

Your Pattern.DOTALL is redundant because you are not using . anywhere in your regex

\s in your regex would match newlines because it is similar to [\r\n\t ]

The problem is with your [\\:\\,\\w\\s\\.\\„\\”\\-\\(\\)0123456789( )(\r?\n)]+..It should ([:,\\w\\s.„”()-]| )+

Match new line using regular expressions java?

2 Answers2