Skip/remove non-ascii character with sed

Question

Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa

I've been trying to use sed to modify email addresses in a .csv but the line above keeps tripping me up, using commands like:

sed -i 's/[\d128-\d255]//' FILENAME

from this stackoverflow question

doesn't seem to work as I get an 'invalid collation character' error.

Ideally I don't want to change that combined AE character at all, I'd rather sed just skip right over it as I'm not trying to manipulate that text but rather the email addresses. As long as that AE is in there though it causes my sed substitution to fail after one line, delete the character and it processes the whole file fine.

Any ideas?

score 6 · Accepted Answer · answered Dec 20 '11 at 10:52

6

This might work for you (GNU sed):

echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" |
sed 's/\o346/a+e/g'
Chip,Dirkland,Droba+eSphere Inc,cdirkland@hotmail.com,usa

Then do what you have to do and after to revert do:

echo "Chip,Dirkland,Droba+eSphere Inc,cdirkland@hotmail.com,usa" | 
sed 's/a+e/\o346/g'
Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa

If you have tricky characters in strings and want to understand how sed sees them use the l0 command (see here). Also very useful for debugging difficult regexps.

echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" | 
sed -n 'l0'
Chip,Dirkland,Drob\346Sphere Inc,cdirkland@hotmail.com,usa$

answered Dec 20 '11 at 10:52

potong

55,640
6
51
83

+1 for the `l0`. There is another `sedsed.py` script too, available [here](http://aurelio.net/sedsed/). Useful to inspect `pattern` and `hold` spaces. Might not help in this case but a useful debugging tool none the less. :) – jaypal singh Dec 20 '11 at 18:32
that sed -n 'l0' command is interesting, what it prints out for company is: Drob\357\277\275Sphere Inc – xref Dec 21 '11 at 03:10
and I still can't get the examples above to work with it, perhaps the character (which shows as an AE in Windows LibreOffice but nowhere else) is actually a special character saying it can't be represented in unicode? http://www.fileformat.info/info/unicode/char/fffd/index.htm – xref Dec 21 '11 at 03:28
I never did get any of the answers on this page to work perfectly, but potong's solution got me the closest and the command provided some more exact detail on what was going wrong – xref Jul 23 '12 at 16:40
Does not help to remove all non-ASCII characters. Only helps to remove specific one given in example. – Jason C Jun 18 '14 at 15:15

jcalfee314 · Answer 2 · 2012-01-17T18:59:30.643

5

sed -i 's/[^[:print:]]//' FILENAME

Also, this acts like dos2unix

edited Jan 17 '12 at 18:59

answered Jan 17 '12 at 18:48

jcalfee314

4,642
8
43
75

Does not work. [:print:] is not the same as ASCII, e.g. `ü` is printable but not ASCII. – Jason C Jun 18 '14 at 15:15

score 3 · Answer 3 · answered Sep 11 '20 at 03:50

The issue you are having is the local.

if you want to use a collation range like that you need to change the character type and the collation type.

This fails as \x80 -> \xff are invalid in a utf-8 string. note \u0080 != \x80 for utf8.

anyway to get this to work just do

LC_ALL=C sed -i 's/[\d128-\d255]//' FILENAME

this will override LC_CTYPE and LC_COLLATE for the one command and do what you want.

mxmlnkn · Answer 4 · 2021-07-02T19:47:48.517

I came here trying this sed command s/[\x00-\x1F]/ /g;, which gave me the same error message.

in this case it simply suffices to remove the \x00 from the collation, yielding s/[\x01-\x1F]/ /g;

Unfortunately it seems like all characters above and including \x7F and some others are disallowed, as can be seen with this short script:

for (( i=0; i<=255; i++ )); do 
    printf "== $i - \x$(echo "ibase=10;obase=16;$i" | bc) =="
    echo '' | sed -E "s/[\d$i-\d$((i+1))]]//g"
done

Note that the problem is only the use of those characters to specify a range. You can still list them all manually or per script. E.g. to come back to your example:

sed -i 's/[\d128-\d255]//' FILENAME

would become

c=; for (( i=128; i<255; i++ )); do c="$c\d$i"; done
sed -i 's/['"$c"']//' FILENAME

which would translate to:

sed -i 's/[\d128\d129\d130\d131\d132\d133\d134\d135\d136\d137\d138\d139\d140\d141\d142\d143\d144\d145\d146\d147\d148\d149\d150\d151\d152\d153\d154\d155\d156\d157\d158\d159\d160\d161\d162\d163\d164\d165\d166\d167\d168\d169\d170\d171\d172\d173\d174\d175\d176\d177\d178\d179\d180\d181\d182\d183\d184\d185\d186\d187\d188\d189\d190\d191\d192\d193\d194\d195\d196\d197\d198\d199\d200\d201\d202\d203\d204\d205\d206\d207\d208\d209\d210\d211\d212\d213\d214\d215\d216\d217\d218\d219\d220\d221\d222\d223\d224\d225\d226\d227\d228\d229\d230\d231\d232\d233\d234\d235\d236\d237\d238\d239\d240\d241\d242\d243\d244\d245\d246\d247\d248\d249\d250\d251\d252\d253\d254\d255]//' FILENAME

"_Unfortunately it seems like all characters above and including \x7F and some others are disallowed_". Thanks! That explained why I'm getting the `Invalid collation character` error. — xpt, Jul 22 '17 at 15:01
Very helpful to identify that `\u0000` can't be used as part of a range as well. — MobileVet, Jul 02 '21 at 17:20

score 1 · Answer 5 · edited May 23 '17 at 11:53

1

In this case there is a way to just skip non-ASCII chars, not bothering with removing.

LANG=C sed /someemailpattern/

See https://bugzilla.redhat.com/show_bug.cgi?id=440419 and Will sed (and others) corrupt non-ASCII files?.

edited May 23 '17 at 11:53

Community

1
1

answered Apr 03 '12 at 15:00

Vadzim

24,954
11
143
151

jaypal singh · Answer 6 · 2011-12-20T09:05:47.173

0

How about using awk for this. We setup the Field Separator to nothing. Then loop over each character. Use an if loop to check if it matches our character class. If it does we print it else we ignore it.

awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i}'

Test:

[jaypal:~/Temp] echo "Chip,Dirkland,DrobæSphere Inc,cdirkland@hotmail.com,usa" | 
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i}'
Chip,Dirkland,DrobSphere Inc,cdirkland@hotmail.com,usa

Update:

awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i; printf "\n"}' < datafile.csv > asciidata.csv

I have added printf "\n" after the loop to keep the lines separate.

edited Dec 20 '11 at 09:05

answered Dec 20 '11 at 07:47

jaypal singh

74,723
23
102
147

Thanks Jaypal, how would this be modified if you wanted to process datafile.csv and output asciidata.csv? – xref Dec 20 '11 at 08:54
If you only want e-mail address extracted from your input file then `awk` can do that in a breeze without any complex `regex`. Let me know how it works out. – jaypal singh Dec 20 '11 at 17:18

Skip/remove non-ascii character with sed

6 Answers6

Linked