0

I have a huge file and that file contains a lot of illegal characters like in the image below, but these are not all. They are of many different kinds so it's not possible to search for them all and replace them. Is there a way i can remove these characters. I've tried a lot of solutions like converting to ANSI, or some regex expression but they didn't work. Please help.

EDIT: Even if anyone can tell me how to remove these characters in java, that will be fine too.

these are just few characters but there are many many different characters

Syed Muhammad Oan
  • 687
  • 2
  • 15
  • 39

1 Answers1

0

Instead of removing specific characters it's easier to implement a white-list filter if you know which types of characters you are expecting.

As per this answer, which explains how to remove emoticons you can try:

String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter, "");

To understand what \p{} groups are available look at Classes for Unicode scripts, blocks, categories and binary properties docs:

\p{IsLatin} A Latin script character (script)

\p{InGreek} A character in the Greek block (block)

\p{Lu} An uppercase letter (category)

\p{IsAlphabetic} An alphabetic character (binary property)

\p{Sc} A currency symbol

\P{InGreek} Any character except one in the Greek block (negation)

[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111