9

I'm trying to use the following command on a text file:

$ sort <m.txt | uniq -c | sort -nr >m.dict 

However I get the following error message:

sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘enwedig\r’ and ‘mwy\r’.

I'm using Cygwin on Windows 7 and was having trouble earlier editing m.txt to put each word within the file on a new line. Please see:

Using AWK to place each word in a text file on a new line

I'm not sure if I'm getting these errors due to this, or because m.txt contains characters from the Welsh alphabet (When I was working with Welsh text in Python, I was required t change the encoding to 'Latin-1').

I tried following the error message's advice and changing LC_ALL='C' however this has not helped. Can anyone elaborate on the errors I'm receiving and provide any advice on how I might go about trying to solve this problem.

UPDATE:

When trying dos2unix, errors were being displayed about invalid characters at certain lines. It turns out these were not Welsh characters, but other strange characters (arrows etc). I went through my text file removing these characters until I was able to use the dos2unix command without error. However, after using the dos2unix command all the text was concatenated (no spaces/newlines or anything, whereas it should have been so that each word in the file was on a seperate line) I then used unix2dos and the text file was back to normal. How can I each word on its own individual line and use the sort command without it giving me errors about '\r' characters?

Community
  • 1
  • 1
hjalpmig
  • 702
  • 1
  • 13
  • 39
  • 2
    `dos2unix` doesn't lead to one long line; it's only the Windows tools that don't understand Unix line endings. Don't use a Windows editor to look at a Unix file, use a Unix editor such as `vi` and you'll see each word on one line. And make sure you use the cygwin sort program, not the Windows sort program. Use `/usr/bin/sort` to be sure. – Jens Apr 01 '16 at 21:25
  • Ah I see. My problem is still not quite solved but I think now it has drifted too far from the original question so I've created another. I will close this question now, thanks for the help. – hjalpmig Apr 01 '16 at 21:29

2 Answers2

11

I know it's an old question, but just running the command export LC_ALL='C' does the trick as described by sort: Set LC_ALL='C' to work around the problem..

Philip Rollins
  • 1,271
  • 8
  • 19
3

Looks like a Windows line-ending related problem (\r\n versus \n). You can convert m.txt to Unix line-endings with

dos2unix m.txt

and then rerun your command.

Jens
  • 69,818
  • 15
  • 125
  • 179
  • Hi, this gives the this message "dos2unix: Binary symbol 0x1A found at line 11451024 dos2unix: Skipping binary file m.txt" and then when i try the original command i get the same error. Any ideas? – hjalpmig Mar 30 '16 at 17:03
  • @hjalpmig Do you know the *encoding* of the file? I.e. is it UTF-8, Windows code page X, some other encoding? How was this file created? Does it look fine when opened with a Windows editor? – Jens Mar 30 '16 at 20:51
  • It looks fine when opened in a text editor (Notepad). I'm not entirely sure on the encoding, but it contains Welsh language characters such as: â, ê, î, ô, û, ŵ, ŷ. I also tried dos2unix with the -f command and it runs, but then when I try the sort its the same error. – hjalpmig Mar 31 '16 at 12:38
  • You can try if any of the UTF-8 locales works. List the available locales with `locale -a`, then use e.g. `export LC_ALL=en_US.UTF-8`. Verify the setting with `locale`, then run the pipe again. If you suspect the encoding is some ISO8859, do the same with an appropriate locale. – Jens Mar 31 '16 at 14:05
  • I believe Welsh would be part of 'ISO/IEC 8859-14'. How can I change the locale to that? It doesn't show when listing locales with 'locale -a'. – hjalpmig Mar 31 '16 at 19:35
  • If `locale` doesn't show it, then the C library does not support that locale. In that case, maybe the `iconv` codeset converter can convert it to a usable encoding. Failing that, it's time to think outside the box: delete the welsh lines; create the file with a usable encoding; do it with the Windows tools,... – Jens Mar 31 '16 at 20:53