0

Problem: I have a bunch of txt files written in Portuguese on a Windows machine using NotePad. Some of them seem to have been encoded as ANSI. When I open these files using gedit on Ubuntu, some of them contain boxes containing 008D (see screenshot). This is after converting them to UTF-8.

Screenshot gedit

When I print the file contents to the terminal using cat, head or more, this is the output of the same file. Note that everything from última vez up to and including the strange character isn't printed to the terminal.

Olá madrinha! Eu gostava de ir contigo passar férias na montanha, porque acho que vai ser divertido e há muito tempo que já não vou a tua casa e na últimaar a ter saudades. Madrinha, não sei porque és tão simpática comigo mas para mim, és a melhor madrinha do mundo inteiro.  Madrinha, ajudas-me sempre que preciso e estás sempre a apoiar-me por isso, quero ir a tua casa para te apoiar a ti.  Obrigada madrinha, por me apoiares.  Muito obrigado madrinha

When I open the same file using atom.io, everything looks as it should:

Screenshot atom

Questions: Most pressingly: How can I get rid of this character without opening all files and manually deleting them? And secondly, what is this, i.e., what should I google to solve similar problems?

jvh_ch
  • 337
  • 2
  • 11

1 Answers1

0

Found the magical keywords ('remove unicode string using sed'). This does the trick: https://stackoverflow.com/a/8562661/1331521:

# Define unicode character you want to remove;
# In this case 008D:
CHARS=$(python -c 'print u"\u008D".encode("utf8")')

# Then run sed on all files in directory
sed -i 's/['"$CHARS"']//g' *
Community
  • 1
  • 1
jvh_ch
  • 337
  • 2
  • 11