I've been trying for the past few days now to perform Zipf's Law experiment on a text file, using Cygwin on Windows 7. As soon as I fix one problem, another one seems to crop up. Please see my other question below if you want background information on the other problems I had:
sort: string comparison failed Invalid or incomplete multibyte or wide character
When I try to use the following sort command on my text file:
sort <m.txt | uniq -c | sort -nr >m.dict
I get the following error:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘ogystal’ and ‘\342'i’.
I believe this is due to the \342 character (and a couple of others, I've seen it show \357 also). As far as I know, \342 is an 'invisible' non-printable character.
I'm trying to use sed (simply by following an online tutorial, I've not used it before) to remove these characters with the commands:
sed 's/'`echo "\342"`'//g' m.txt
and
sed -e 's/'$(echo "\342")'//g' m.txt
However both of these commands give me the same error:
sed: -e expression #1, char 10: Invalid back reference
How can I use sed correctly to remove these troublesome non-printable characters?