8

Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

I need to extract paragraphs (like title in StackOverflow) from an html file.

I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.

EXAMPLE

field extracted:

Paging Lucene&#39s search results (with **;** among **&#39** and **s**)

field after decoding:

Paging Lucene's search results

Is there any class in java that will allow me to convert these html codes?

Community
  • 1
  • 1
user
  • 245
  • 1
  • 5
  • 13
  • Does your HTML contain tags? – Mike Samuel Dec 06 '12 at 18:43
  • Yes, but the field extracted doesn't contain tags – user Dec 06 '12 at 18:44
  • 5
    For starters, [using regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) is utterly wrong in first place. Just use a [HTML parser](http://stackoverflow.com/questions/3152138/what-are-the-pros-and-cons-of-the-leading-java-html-parsers) like Jsoup. A bit decent one would immediately already unescape HTML for you. – BalusC Dec 06 '12 at 18:47

2 Answers2

31

Use methods provided by Apache Commons Lang

import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Manish Singh
  • 5,848
  • 4
  • 43
  • 31
jlordo
  • 37,490
  • 6
  • 58
  • 83
  • 1
    https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html#unescapeHtml(java.lang.String) - Latest link – useranon Feb 22 '17 at 09:24
3

Do not try to solve everything by regexp.

While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.

See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!

Chuck Norris can parse HTML with regex.

The bad news is: there is more than one way to encode characters.

https://en.wikipedia.org/wiki/Character_encodings_in_HTML

For example, the character 'λ' can be represented as λ, λ or λ

And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. ™ for example is not valid, yet many browsers will interpret it as .

Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.

So I strongly recommend:

  • Feed string into a robust HTML parser
  • Get parsed (and fully decoded) string back
Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    I need to extract from htmls with same structures and tags (like wikipedia). So I think regex is a good approach. – user Dec 06 '12 at 19:19
  • 3
    @MrCarAsus: NO IT IS NOT. Use a HTML parser, and DOM for extraction. That is what they are for! – Has QUIT--Anony-Mousse Dec 06 '12 at 19:20
  • Try using DBPedia, btw. It is an already parsed version of Wikipedia. – Has QUIT--Anony-Mousse Dec 06 '12 at 19:21
  • And do you know a parsed version of StackOverflow? I try to use regex with stackoverflow htmls and it works. I extract title and answers with a set of regexps applied on htlm. – user Dec 06 '12 at 19:32
  • Use an HTML parser. Every time you rape HTML with a regexp parsing attempt, god kills a kitten. – Has QUIT--Anony-Mousse Dec 06 '12 at 19:42
  • Seriously, read [using Regexp to parse HTML is wrong](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). HTML is a Chomsky Type 2 language, and Regexp is of type 3. You need a Type 2 parser. – Has QUIT--Anony-Mousse Dec 06 '12 at 19:44
  • Plus **there are plenty of HTML parsers around**. Why don't you just try using them? The StackOverflow data dump is also quite well pre-parsed, btw. - you can get a lot of information out of it with a simple XML pull parser, and not having to do anything yourself. – Has QUIT--Anony-Mousse Dec 06 '12 at 19:45
  • Re "`™` for example is not valid" is perfectly valid though possibly interpreted inconsistently by user-agents? [Section 4.6 of HTML 5](http://www.w3.org/TR/html-markup/syntax.html#dec-charref) puts no bounds on the codepoints that can be represented by decimal numeric character references and that codepoint is a [valid control character codepoint](http://www.unicode.org/charts/PDF/U0080.pdf). – Mike Samuel Dec 07 '12 at 01:26
  • 1
    @MikeSamuel The page says in number 3: "**not** ... in the range U+0080–U+009F". 0x0099 is in this range. – Has QUIT--Anony-Mousse Dec 07 '12 at 08:57