3

I am processing text corpus. It contains several characters belonging to different languages, symbols, numbers, etc.

-> All I need to do is to skip the symbols like arrow mark, heart symbol, etc.

-> I should not be spoiling any characters of different languages.

Any leads?

----UPDATE----

Character.isLetter('\unicode') is working for most of them, if not some. I have checked my regional languages, it seems it's working for some but not each and every.

Thanks.

Firefox
  • 941
  • 2
  • 14
  • 22
  • This problem is not very well specified. Are you familiar with how the Unicode general categories work? Each code point belongs to exactly one of Letter, Number, Symbol, Punctuation, Mark, Separator, or Other (usually control charcters). There are subdivisions within each of those. – tchrist Feb 02 '11 at 14:24

2 Answers2

1

If i understnad correctly, the characters you want to remove are of a rather limited set. Why not just check for these? Unicode has a whole bunch of non-letter characters, but in your case, the non-letter characters encountered will probably be a small subset of what exists.

Sounds like a job for regular expressions, if you ask me. Remove everything that's not a word character, digit or whitespace, and you've probably got it. Or create an array containing all characters you want filtered out (which in that case should be few and known).

Arne
  • 3,006
  • 1
  • 22
  • 21
  • 1
    the problem is that Java’s patterns [do not yet fully support the Unicode properties](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261) you’d need for this. For example, it’s missing a ‘word’ character. You may emulate that with `[\pL\pN\pM\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]`. Digits are `\p{Nd}`, so already in `\pN`. Java does not support the Unicode White_Space property, but you can precisely emulate it with `[\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]`. – tchrist Feb 02 '11 at 14:30
0

You could implement a Charset that contains only the characters you want. You can then provide a CharsetDecoder to decode the text and strip out the characters you want to skip.

Qwerky
  • 18,217
  • 6
  • 44
  • 80
  • Thanks for the reply Qwerky. It is probable if the Charset length is definite. Otherwise I should be collecting the chars of all the existing languages. What I was hoping to find is, some library or some situation like all these symbols belong to a particular charset so that I can specify to skip, or any other solution. – Firefox Feb 02 '11 at 13:16
  • @Firebox: Do you understand how the Unicode general category properties work? Also, with JDK7 you *finally* have access to the script properties, which would allow you (for example) to detect that something were neither Script=Common, Script=Latin, or Script=Greek. – tchrist Feb 02 '11 at 14:21