0

I've been trying for the past few days now to perform Zipf's Law experiment on a text file, using Cygwin on Windows 7. As soon as I fix one problem, another one seems to crop up. Please see my other question below if you want background information on the other problems I had:

sort: string comparison failed Invalid or incomplete multibyte or wide character

When I try to use the following sort command on my text file:

sort <m.txt | uniq -c | sort -nr >m.dict 

I get the following error:

 sort: string comparison failed: Invalid or incomplete multibyte or wide character
 sort: Set LC_ALL='C' to work around the problem.
 sort: The strings compared were ‘ogystal’ and ‘\342'i’.

I believe this is due to the \342 character (and a couple of others, I've seen it show \357 also). As far as I know, \342 is an 'invisible' non-printable character.

I'm trying to use sed (simply by following an online tutorial, I've not used it before) to remove these characters with the commands:

sed 's/'`echo "\342"`'//g' m.txt

and

sed -e 's/'$(echo "\342")'//g' m.txt

However both of these commands give me the same error:

sed: -e expression #1, char 10: Invalid back reference

How can I use sed correctly to remove these troublesome non-printable characters?

Community
  • 1
  • 1
hjalpmig
  • 702
  • 1
  • 13
  • 39
  • I'm not sure if sed is the right tool, but to solve the backref issue, you can quote like this: `sed -e 's/'"$(echo \"\342\")"'//g' m.txt` - and maybe it should be `echo -e`? http://stackoverflow.com/questions/602912/how-do-you-echo-a-4-digit-unicode-character-in-bash – Benjamin W. Apr 01 '16 at 21:32
  • 3
    Maybe a better way of removing all characters with the 8th bit set is `tr -d '\200-\377'`. – Jens Apr 01 '16 at 21:36
  • @Jens I was fiddling with tr earlier, is is possible but it was echoing each line to the console in cygwin (which takes a while with 10s of thousand of lines), can I use it without the echo? And would I use your command as 'tr -d '\200- \377' m.txt ' ? – hjalpmig Apr 01 '16 at 21:42
  • 1
    @hjalpmig, you just need to redirect `tr` output to a file, e.g. `tr ... > m-sanitized.txt` – Piotr Findeisen Apr 01 '16 at 21:55
  • @Jens Thanks, using what you posted I've solved my problem! If you wish to make it into an answer, I will mark the question as solved once the 2d limit is passed. – hjalpmig Apr 01 '16 at 22:09

3 Answers3

2

A better way of removing all characters with the 8th bit set is

tr -d '\200-\377' m.txt > m-no-8bit.txt
Jens
  • 69,818
  • 15
  • 125
  • 179
0

The syntax using sed would be

LC_ALL=C sed 's/'$'\342''//g'

(using the bash $'...' quoting to interpret the character code before passing it to sed).

Guido
  • 876
  • 5
  • 14
0

This might work for you (GNU sed):

sed 's/\o342//g' file

To remove the octal value 342 use \o342.

potong
  • 55,640
  • 6
  • 51
  • 83