0

I have to replace all non latin-1 characters in the text of a large dataset file. An example is like

LABEL   chini vich 妈妈媽媽 maama
LABEL   南支那海 南シナ海 shabadik ar h ngkhani shina saagar
LABEL   ॐ आप्यायन्तु ममाङ्गानि वाक्प्राणश्चक्षुः

where the 2nd column tabbed space owns the text to be found and replaced with a space. The regex to find all latin-1 chars can be obtained using character classes like

echo "chini vich 妈妈媽媽 maama" | sed "s/[[:alnum:]]*//g"
    妈妈媽

My aim is to do exactly the opposite:

echo "chini vich 妈妈媽媽 maama" | sed "s/(SOME REGEX)//g"
        chini vich maama

so replacing any occurence of a non latin-1 character sequence with a space \s. I have tried to negate the character class [:alnum:] i.e. [^A-Za-z0-9] but it does not work.

NOTE Since the 1st column will not contain any non latin-1 char, and it's alphanumeric, there should be no need to apply the regex to the 2nd column, so I think it's ok to apply it to the whole row so in awk it would be like $(0). For rows having non latin-1 char only like in the example below the regex will result in a empty 2nd column:

LABEL   chini vich maama
LABEL   shabadik ar h ngkhani shina saagar
LABEL

The similar question Remove non-ASCII characters from CSV is about removing non ASCI characters, here we are dealing with ISO 8859 ASCII extension. For more info please refers to What are the differences between ASCII, ISO 8859, and Unicode?

loretoparisi
  • 15,724
  • 11
  • 102
  • 146
  • Maybe `[^[:alnum:]]*`? – revo Sep 25 '18 at 08:03
  • yup! the inversion is ok, but I need to keep char spacing, but this would not work `sed "s/[^[:alnum:]]*/ /g"` because it will put spaces among all chars. – loretoparisi Sep 25 '18 at 08:05
  • 1
    Okay, what if you try `sed -r "s/[^[:alnum:]]+/ /g"` (notice `+`)? – revo Sep 25 '18 at 08:14
  • @revo `sed -r "s/[^[:alnum:]]+/ /g"` works!. When on `macos` the `-r` options (aka regex extended) is `-E` so it would be `echo "chini vich 妈妈媽媽 maama" | sed -E "s/[^[:alnum:]]+/ /g"`. Thank you. – loretoparisi Sep 25 '18 at 12:40

0 Answers0