Replacing unicode punctuation with ASCII approximations

Question

I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..

For example, the source sentence may contain the Unicode directional quotation U2018 (‘) and I would like to convert that to U0027 ('). Eventually I will be stripping the remaining Unicode.

I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.

This is what I could, but I'm sure I will make mistakes/miss things/etc.:

    // double quotation (")
    replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\""));

    // single quotation (')
    replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'"));

replacements is a custom class that I later run over and apply the replacements.

    for (Replacement replacement : replacements) {
         text = replacement.pattern.matcher(text).replaceAll(r.replacement);
    }

As you can see, I had to find:

LEFT SINGLE QUOTATION MARK
RIGHT SINGLE QUOTATION MARK
SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)
SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)

Are you looking for a library and/or example code in a particular language? Or are you looking for a pre-existing mapping of Unicode characters onto ASCII approximations? I'm not sure what the difference is between a regex and code you can reuse. — Mu Mind, Jan 26 '11 at 19:32
I am looking for a Java library. I can write regular expressions, but I'm sure I will miss something in the process. I was wondering if someone else has already made decisions for me. Have you been reading GEB, Mu Mind? — schmmd, Jan 26 '11 at 19:49

score 16 · Answer 1 · answered Jun 19 '11 at 19:28

16

I found a pretty extensive table that maps Unicode punctuation to their closest ASCII equivalents.

Here's more info: Map Symbols & Punctuation to ASCII.

answered Jun 19 '11 at 19:28

Marek Stój

4,075
6
49
50

4

I translated that list to Scala and put it here: https://gist.github.com/dirkgr/6349f379740880209475 – Dirk Groeneveld Sep 02 '14 at 19:40
@schmmd has a more comprehensive version below. – Dirk Groeneveld Sep 02 '14 at 20:22

score 8 · Answer 2 · answered Jun 24 '13 at 16:16

I followed @marek-stoj's link and created a Scala application that cleans unicode out of strings while maintaining the string length. It remove diacritics (accents) and uses the map suggested by @marek-stoj to convert non-Ascii unicode characters to their ascii approximations.

import java.text.Normalizer

object Asciifier {
  def apply(string: String) = {
    var cleaned = string
      for ((unicode, ascii) <- substitutions) {
        cleaned = cleaned.replaceAll(unicode, ascii)
      }

    // convert diacritics to a two-character form (NFD)
    // http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
    cleaned = Normalizer.normalize(cleaned, Normalizer.Form.NFD)

    // remove all characters that combine with the previous character
    // to form a diacritic.  Also remove control characters.
    // http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
    cleaned.replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{Cntrl}]", "")

    // size must not change
    require(cleaned.size == string.size)

    cleaned
  }

  val substitutions = Set(
      (0x00AB, '"'),
      (0x00AD, '-'),
      (0x00B4, '\''),
      (0x00BB, '"'),
      (0x00F7, '/'),
      (0x01C0, '|'),
      (0x01C3, '!'),
      (0x02B9, '\''),
      (0x02BA, '"'),
      (0x02BC, '\''),
      (0x02C4, '^'),
      (0x02C6, '^'),
      (0x02C8, '\''),
      (0x02CB, '`'),
      (0x02CD, '_'),
      (0x02DC, '~'),
      (0x0300, '`'),
      (0x0301, '\''),
      (0x0302, '^'),
      (0x0303, '~'),
      (0x030B, '"'),
      (0x030E, '"'),
      (0x0331, '_'),
      (0x0332, '_'),
      (0x0338, '/'),
      (0x0589, ':'),
      (0x05C0, '|'),
      (0x05C3, ':'),
      (0x066A, '%'),
      (0x066D, '*'),
      (0x200B, ' '),
      (0x2010, '-'),
      (0x2011, '-'),
      (0x2012, '-'),
      (0x2013, '-'),
      (0x2014, '-'),
      (0x2015, '-'),
      (0x2016, '|'),
      (0x2017, '_'),
      (0x2018, '\''),
      (0x2019, '\''),
      (0x201A, ','),
      (0x201B, '\''),
      (0x201C, '"'),
      (0x201D, '"'),
      (0x201E, '"'),
      (0x201F, '"'),
      (0x2032, '\''),
      (0x2033, '"'),
      (0x2034, '\''),
      (0x2035, '`'),
      (0x2036, '"'),
      (0x2037, '\''),
      (0x2038, '^'),
      (0x2039, '<'),
      (0x203A, '>'),
      (0x203D, '?'),
      (0x2044, '/'),
      (0x204E, '*'),
      (0x2052, '%'),
      (0x2053, '~'),
      (0x2060, ' '),
      (0x20E5, '\\'),
      (0x2212, '-'),
      (0x2215, '/'),
      (0x2216, '\\'),
      (0x2217, '*'),
      (0x2223, '|'),
      (0x2236, ':'),
      (0x223C, '~'),
      (0x2264, '<'),
      (0x2265, '>'),
      (0x2266, '<'),
      (0x2267, '>'),
      (0x2303, '^'),
      (0x2329, '<'),
      (0x232A, '>'),
      (0x266F, '#'),
      (0x2731, '*'),
      (0x2758, '|'),
      (0x2762, '!'),
      (0x27E6, '['),
      (0x27E8, '<'),
      (0x27E9, '>'),
      (0x2983, '{'),
      (0x2984, '}'),
      (0x3003, '"'),
      (0x3008, '<'),
      (0x3009, '>'),
      (0x301B, ']'),
      (0x301C, '~'),
      (0x301D, '"'),
      (0x301E, '"'),
      (0xFEFF, ' ')).map { case (unicode, ascii) => (unicode.toChar.toString, ascii.toString) }
}

You have a bug: `replaceAll` doesn't mutate string. You need to assign result of `replaceAll` back to cleaned. — slawek, Jun 24 '15 at 18:47

score 7 · Accepted Answer · answered Jan 26 '11 at 21:14

7

Each unicode character is assigned a category. There exists two separate categories for quotes:

With these lists, you should be able to handle all quotes appropriately, if you would like to code the regex manually.

Java Character.getType gives you the category of character, for example FINAL_QUOTE_PUNCTUATION.

Now you can get the category of each (punctuation-)character and replace it with an appropriate supplement in ASCII.

You can use the other punctuation categories accordingly. In 'Punctuation, Other' there are some characters, for example PRIME ′, which you may also want to substitute with an apostrophe.

answered Jan 26 '11 at 21:14

Michael Konietzka

5,419
2
28
29

I'm resorting to just using a custom map, with as many characters as I can define, because the Unicode categories assigned to basic characters seem inadequate. For example, the basic single and double quote characters (the ones you type into notepad using your keyboard for example) are categorized as "Punctuation Other", rather than the Punctuation Initial and Punctuation Final categories that you'd expect them to be categorized under. – Triynko Apr 14 '11 at 17:55
@Triynko - the problem there is: there is only one "normal" (ASCII) single quote and one double quote, so marking it as either `INITIAL` or `FINAL` quote punctuation would also be wrong. – Stephen P Apr 14 '11 at 19:31

score 3 · Answer 4 · answered Jul 25 '12 at 18:26

Here's a Python package that does a good job. It's based on a Perl module Text::Unidecode. I assume this could be ported to Java.

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

http://pypi.python.org/pypi/Unidecode

score 3 · Answer 5 · answered Jan 27 '11 at 17:25

3

While this does not exactly answers your question, you can convert your Unicode text to US-ASCII replacing non-ASCII characters with '?' symbols.

String input = "aáeéiíoóuú"; // 10 chars.

Charset ch = Charset.forName("US-ASCII");
CharsetEncoder enc = ch.newEncoder();
enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
enc.replaceWith(new byte[]{'?'});

ByteBuffer out = null;

try {
    out = enc.encode(CharBuffer.wrap(input));
} catch (CharacterCodingException e) { 
    /* ignored, shouldn't happen */ 
}

String outStr = ch.decode(out).toString();

// Prints "a?e?i?o?u?"
System.out.println(outStr);

answered Jan 27 '11 at 17:25

vz0

32,345
7
44
77

1

I remove diacritics with Normalizer.normalize(text, Normalizer.Form.NFD) followed by a replace with Pattern.compile("\\p{InCombiningDiacriticalMarks}+"). – schmmd Jan 27 '11 at 21:42
With this solution, basic punctuation marks like quotes that ought to be mapped are not mapped to the ASCII quote. Many other Unicode characters that you would say "this is basically the same thing as this ASCII character" will not get mapped properly. Therefore, I think that using a custom map with all reasonable replacements would achieve better results. – Triynko Apr 14 '11 at 18:01

score 2 · Answer 6 · answered Jan 27 '11 at 17:38

What I've done for similar substitutions is create a Map (usually HashMap) with the Unicode characters as the keys and their substitute as the values.

Pseudo-Java; the for depends on what sort of character container you're using as a parameter to the method that does this, e.g. String, CharSequence, etc.

StringBuilder output = new StringBuilder();
for (each Character 'c' in inputString)
{
    Character replacement = xlateMap.get( c );
    output.append( replacement != null ? replacement : c );
}
return output.toString();

Anything in the Map is replaced, anything not in the Map is unchanged and copied to output.

score 1 · Answer 7 · answered Aug 06 '21 at 12:41

String lstring = "my string containing all different simbols";

lstring = lstring.replaceAll("\u2013", "-")
    .replaceAll("\u2014", "-")
    .replaceAll("\u2015", "-")
    .replaceAll("\u2017", "_")
    .replaceAll("\u2018", "\'")
    .replaceAll("\u2019", "\'")
    .replaceAll("\u201a", ",")
    .replaceAll("\u201b", "\'")
    .replaceAll("\u201c", "\"")
    .replaceAll("\u201d", "\"")
    .replaceAll("\u201e", "\"")
    .replaceAll("\u2026", "...")
    .replaceAll("\u2032", "\'")
    .replaceAll("\u2033", "\"");

Replacing unicode punctuation with ASCII approximations

7 Answers7

Linked

Related