0

In my file somehow  is getting added. I am not sure what it is and how it is getting added.

12345A 210 CBCDEM

I want to remove this character from the file . I tried basic sed command to get it remove but unsuccessful.

  sed -i -e 's/\Â//g'

I also read that dos2unix will do the job but unfortunately that also didn't work .Assuming it was hex character I also tried to remove it using hex value sed -i 's/\xc2//g' but that also didnt work

I really want to understand what this character is and how it is getting added. Moreover , is there possible way to delete all such characters in a file .

Adding encoding details :--

file test.txt 
test.txt: ISO-8859 text
echo $LANG
en_US.UTF-8

OS Details :--

uname -a
Linux vm-testmachine-001 3.10.0-693.11.1.el7.x86_64 #1 SMP Fri Oct 27 05:39:05 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

Regards.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
user2854333
  • 640
  • 6
  • 20
  • 2
    How is the file created? – Mad Physicist Jan 08 '18 at 18:12
  • The file is getting created as export from Mongo . Mongo doesnt have such characters. – user2854333 Jan 08 '18 at 18:13
  • Sure. Can you check the encoding your system uses vs the encoding your database uses? – Mad Physicist Jan 08 '18 at 18:14
  • This should be useful I guess [How to remove all of the diacritics from a file?](https://stackoverflow.com/questions/10207354/how-to-remove-all-of-the-diacritics-from-a-file) – Inian Jan 08 '18 at 18:17
  • Try running `file ` and include the result in your question. Something like `od -c | head` might also be useful. We need more information. – ghoti Jan 08 '18 at 18:20
  • @Mad Physicist My unix machine encoding is en_US.UTF-8 . Mongo also I assume by default should be utf 8 . – user2854333 Jan 08 '18 at 18:21
  • How do you see the  characters? A graphical text editor? `cat`? – Mad Physicist Jan 08 '18 at 18:28
  • Oh wait, the encoding of the file is very much not UTF-8. There's your problem. – Mad Physicist Jan 08 '18 at 18:29
  • iconv options -f ISO-8859 -t en_US.UTF-8 test.txt -o test.txt iconv: conversions from `ISO-8859' and to `en_US.UTF-8' are not supported This also fails. – user2854333 Jan 08 '18 at 18:32
  • just plain `utf8` or `utf-8`. `en_US` is a locale. It specifies the digit separators, currency symbol, date format, etc that you use. The character encoding is UTF-8. The two things are completely independent of each other in most cases. – Mad Physicist Jan 08 '18 at 18:37
  • Also, be careful about setting the output file to the input file. You may end up truncating the input before you have a chance to read it. I would try it on a throwaway file first. – Mad Physicist Jan 08 '18 at 18:38
  • Look up the actual encoding used by your DB. Remember that the `file` program uses a heuristic to determine the encoding of a text file. It looked for non-ascii characters in the file and output the *most likely* encoding given the output found. I suspect that your `Â ` combinations are intended to be some weird tab or newline combinations, but I can't say without knowing the intended encoding for sure. – Mad Physicist Jan 08 '18 at 19:57
  • Another thing that would really help is if you could do a hex dump of the original snippet you posted. Seeing the binary values might help identify it as well – Mad Physicist Jan 08 '18 at 19:58
  • Final point, why did you do `sed -i -e 's/\Â//g'`? Does plain old `sed -i -e 's/Â//g'` not work? I would expect that the backslash is the only thing tripping you up there... – Mad Physicist Jan 08 '18 at 20:17

1 Answers1

3

It looks like you have an encoding mismatch between the program that writes the file (in some part of ISO-8859) and the program reading the file (assuming it to be UTF-8). This is a textbook use-case for iconv. In fact the sample in the man-page is almost exactly applicable to your case:

iconv -f iso-8859-1 -t utf-8 test.txt

iconv is a fairly standard program on almost every Unix distribution I have seen, so you should not have any issues here.

Based on the fact that you appear to be writing with English as your primary language, you are probably looking for iso-8859-1, which is quite popular apparently.

If that does not fix your issue, You probably need to find the proper encoding for the output of your database. You can do

iconv -l

to get a list of encodings available for iconv, and use the one that works for you. Keep in mind that the output of file saying ISO-8859 text is not absolute. There is no way to distinguish things like pure ASCII and UTF-8 in many cases. If I am not mistaken, file uses heuristics based on frequencies of character codes in the file to determine the encoding. It is quite liable to make a mistake if the sample is small and/or ambiguous.

If you want to save the output of iconv and your version supports the -o flag, you can use it. Otherwise, use redirection, but carefully:

TMP=$(mktemp)
iconv -f iso-8859-1 -t utf-8 test.txt > "$TMP" && mv "$TMP" test.txt
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • It keeps giving error iconv: conversion from `ISO-8859' is not supported – user2854333 Jan 08 '18 at 18:39
  • @user2854333. Try `iso8859`. What OS are you actually on? Please edit that into the question. – Mad Physicist Jan 08 '18 at 18:39
  • iso8859 also doesnt work . I did iconv -l to get the supported conversion format and it seems iso-8859 is not there we need to have iso-8859-1 or something like that . – user2854333 Jan 08 '18 at 18:45
  • I tried it but the characters still exists although encoding has changed. – user2854333 Jan 08 '18 at 18:55
  • @user2854333. I added an update that may help with that. My version of `iconv` does not allow for output files. Also, could you show the actual command that you use to output your text file? – Mad Physicist Jan 08 '18 at 19:52
  • Iso-8859 is not a valid encoding name. I assume the comments above were trying to say `iso-8859-1` – tripleee Jan 08 '18 at 20:41
  • @tripleee. Fixed, removed all mentions of iso-8859. – Mad Physicist Jan 08 '18 at 22:36
  • I'd still like to emphasize that the OP's question is almost certainly asking the wrong thing. You *don't* want to "replace Â"; this common symptom of an encoding problem should be fixed by correcting the encoding. Indeed, I already upvoted this answer yesterday, while it still had some minor technical issues. – tripleee Jan 09 '18 at 04:11
  • It didnt help unfortunately . I still the characters in file. – user2854333 Jan 09 '18 at 04:35
  • 1
    @user2854333. "It didnt help unfortunately" is not a very helpful comment. I've made numerous requests for additional information in the comments on the question itself which you've completely ignored. I can't really help you without that information, so I've added my close vote until you provide it. – Mad Physicist Jan 09 '18 at 12:58