I have to replace all non latin-1 characters in the text of a large dataset file. An example is like
LABEL chini vich 妈妈媽媽 maama
LABEL 南支那海 南シナ海 shabadik ar h ngkhani shina saagar
LABEL ॐ आप्यायन्तु ममाङ्गानि वाक्प्राणश्चक्षुः
where the 2nd column tabbed space owns the text to be found and replaced with a space. The regex to find all latin-1 chars can be obtained using character classes like
echo "chini vich 妈妈媽媽 maama" | sed "s/[[:alnum:]]*//g"
妈妈媽
My aim is to do exactly the opposite:
echo "chini vich 妈妈媽媽 maama" | sed "s/(SOME REGEX)//g"
chini vich maama
so replacing any occurence of a non latin-1 character sequence with a space \s
.
I have tried to negate the character class [:alnum:]
i.e. [^A-Za-z0-9]
but it does not work.
NOTE
Since the 1st column will not contain any non latin-1 char, and it's alphanumeric, there should be no need to apply the regex to the 2nd column, so I think it's ok to apply it to the whole row so in awk
it would be like $(0)
.
For rows having non latin-1 char only like in the example below the regex will result in a empty 2nd column:
LABEL chini vich maama
LABEL shabadik ar h ngkhani shina saagar
LABEL
The similar question Remove non-ASCII characters from CSV is about removing non ASCI characters, here we are dealing with ISO 8859
ASCII extension. For more info please refers to What are the differences between ASCII, ISO 8859, and Unicode?