2

Hey guys, I've been trying to parse through HTML files to scrape text from them, and every so often, I get some really weird characters like à€œ. I determined that its the "smart quotes" or curly punctuation that is causing the all of my problems, so my temporary fix has been to search for and replace all of these characters with their corresponding HTML codes individually. My question is that is there such a way to use one regular expression (or something else) to search through the string only once and replaces what it needs to based on what is there? My solution right now looks like this:

line = line.replaceAll( "“", "“" ).replaceAll( "”", "”" );
line = line.replaceAll( "–", "–" ).replaceAll( "—", "—" );
line = line.replaceAll( "‘", "‘" ).replaceAll( "’", "’" ); 

For some reason or another, there just seems like there could be a better and possibly more efficient way of doing this. Any input is greatly appreciated.

Thanks,
-Brett

Brett
  • 147
  • 1
  • 12
  • 1
    If you use UTF-8 as page encoding, you don't need any HTML entities at all. How about that? – Sean Patrick Floyd Sep 02 '10 at 04:42
  • @seanizer, you still need `<`, `>` and `&` ;) (well, sometimes you can get by with literal characters if you don't care about validity, but it can cause problems). – eyelidlessness Sep 02 '10 at 05:51
  • Yup, but those are XML entities. I was talking about HTML entities – Sean Patrick Floyd Sep 02 '10 at 06:26
  • Okay smartass. They're also HTML entities. – eyelidlessness Sep 04 '10 at 07:45
  • Whoa, heating up on this thread. For the sake of breaking the argument, I was using HTMLEditorKit in the Java API to do my HTML parsing. I needed the regex patterns to find those multi-byte characters and replace them with their respective entities. I didn't communicate that very well, but oh well. – Brett Sep 04 '10 at 22:54

4 Answers4

3

As stated by others; The recommended method to take care of those characters is to configure your encoding settings.

For comparison, here is a method to re-code UTF-8 sequences as HTML entities using regex:

import java.util.regex.*;

public class UTF8Fixer {
    static String fixUTF8Characters(String str) {
        // Pattern to match most UTF-8 sequences:
        Pattern utf8Pattern = Pattern.compile("[\\xC0-\\xDF][\\x80-\\xBF]{1}|[\\xE0-\\xEF][\\x80-\\xBF]{2}|[\\xF0-\\xF7][\\x80-\\xBF]{3}");

        Matcher utf8Matcher = utf8Pattern.matcher(str);
        StringBuffer buf = new StringBuffer();

        // Search for matches
        while (utf8Matcher.find()) {
            // Decode the character
            String encoded = utf8Matcher.group();
            int codePoint = encoded.codePointAt(0);
            if (codePoint >= 0xF0) {
                codePoint &= 0x07;
            }
            else if (codePoint >= 0xE0) {
                codePoint &= 0x0F;
            }
            else {
                codePoint &= 0x1F;
            }
            for (int i = 1; i < encoded.length(); i++) {
                codePoint = (codePoint << 6) | (encoded.codePointAt(i) & 0x3F);
            }
            // Recode it as an HTML entity
            encoded = String.format("&#%d;", codePoint);
            // Add it to the buffer
            utf8Matcher.appendReplacement(buf,encoded);
        }
        utf8Matcher.appendTail(buf);
        return buf.toString();
    }

    public static void main(String[] args) {
        String subject = "String with \u00E2\u0080\u0092strange\u00E2\u0080\u0093 characters";
        String result = UTF8Fixer.fixUTF8Characters(subject);
        System.out.printf("Subject: %s%n", subject);
        System.out.printf("Result: %s%n", result);
    }
}

Output:

Subject: String with “strange” characters
Result: String with &#8210;strange&#8211; characters

Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
2

There's a huge thread over here that shows you why it is a bad idea to use regex to parse HTML.

Look for external libraries to do this task. An example would be: JSoup. There's also a tutorial included in their webpage that you can use.

Community
  • 1
  • 1
Coding District
  • 11,901
  • 4
  • 26
  • 30
  • 2
    Converting certain characters to entities isn't parsing HTML with regex. – eyelidlessness Sep 02 '10 at 05:53
  • The regex was for the special multi-byte characters and not to parse my HTML, but thanks a ton for the JSoup reference--hands down, tons better than the Java API HTMLEditorKit. – Brett Sep 04 '10 at 22:56
2

Your file appears to be UTF-8 encoded, but you're reading it as though it were in a single-byte encoding like windows-1252. UTF-8 uses three bytes to encode each of those characters, but when you decode it as windows-1252, each byte is treated as a separate character.

When working with text, you should always specify an encoding if possible; don't let the system use its default encoding. In Java, that means using InputStreamReader and OutputStreamWriter instead of FileReader and FileWriter. Any reasonably good text editor should let you specify an encoding as well.

As for your actual question, no, Java doesn't have a built-in facility for dynamic replacements (unlike most other regex flavors). But it's not too difficult to write your own, or even better, use one that someone else wrote. I posted one from Elliott Hughes in this answer.

One last thing: In your sample code you use replaceAll() to do the replacements, which is overkill and a possible source of bugs. Since you're matching literal text and not regexes, you should be using replace(CharSequence,CharSequence) instead. That way you never have to worry about accidentally including a regex metacharacter and going blooey.

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • That bit of advice went a long way last night. After a bit digging on the readers vs input streams, I determined that it would be better if I backed off the input/output streams in favor of readers and writers. Thanks. – Brett Sep 04 '10 at 22:59
-1

Don't use regular expressions for HTML. Use a real parser.

This will also help you getting around any character encodings you might encounter.

Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347
  • 1
    Converting certain characters to entities isn't parsing HTML with regex. – eyelidlessness Sep 02 '10 at 05:52
  • 2
    The "weird characters" looked like handling UTF-8 incorrectly. – Thorbjørn Ravn Andersen Sep 02 '10 at 06:21
  • @Thorbjørn, I realize that. That's still not parsing HTML. – eyelidlessness Sep 03 '10 at 02:17
  • @eyelidlessness, op clearly said: "I've been trying to parse through HTML files to scrape text from them." He IS trying to parse HTML using regex and because of this, he is having problems such as this. Who knows what other problems could arise when the recommended way is to use an external library. – Coding District Sep 04 '10 at 05:48
  • @Coding, if you read the code (which, as we all know, is more correct than its comments), you can see that the OP is replacing characters in text, not parsing HTML. They are parsing characters, which happen to be in an HTML document, but literally none of the HTML parsing rules apply to the question and its solution. What does an external library have to do with the fact that the question doesn't have anything to do with parsing HTML markup? – eyelidlessness Sep 04 '10 at 07:44
  • @eyelidlessness, the phrasing *the "smart quotes" ... is causing all my problems* is a dead give-away that the regexps are just about to break. – Thorbjørn Ravn Andersen Sep 05 '10 at 19:47
  • @Thorbjørn, why? Regex is perfectly capable of detecting single characters like curly quotes. Please explain to me how this has anything whatsoever to do with parsing HTML or any other irregular language. – eyelidlessness Sep 05 '10 at 23:01