6

Here is a screenshot of an issue I'm having with sort:

https://i.stack.imgur.com/QafQy.png

The objective I want out of this, is to put all equal strings on consecutive lines. It works for 99% of the list I'm sorting, but there's a few hitches such as those in the screen shot.

So all the yahoo.coms should be next to each other, and then all the Yahoo.coms then the YAHOO.coms yahoo.cmos yhoo.c etc. (The typos even getting their own group of lines)

Not entirely sure how to handle this with sort, but I'm certainly trying.

I print all the domains unsorted to a file and then sort it with just vanilla sort filename

Would love some advice/input.

Wuzseen
  • 687
  • 3
  • 14
  • 20

2 Answers2

14

You probably need to override the locale; most Linux systems default to a UTF8 locale which specifies both case independent sorting and ignoring punctuation.

LANG=C sort filename
geekosaur
  • 59,309
  • 11
  • 123
  • 114
  • Would it be wise to return LANG to UTF-8 afterwards? Or whatever it was before the sort... – Wuzseen Apr 26 '12 at 03:48
  • 2
    That usage only changes `LANG` for that one command; the global value remains unchanged. – geekosaur Apr 26 '12 at 03:51
  • Wicked, hopefully this fixes it. It's got to run through quite a list of files at first. Thanks for the help. – Wuzseen Apr 26 '12 at 03:52
  • @geekosaur: I've run in to this problem before, but I never really understood why UTF-8 collating sequence would be different when sorting ascii data. Couldn't that be considered a bug? – Barton Chittenden Apr 26 '12 at 04:31
  • @BartonChittenden, it's considered a feature apparently, and has been for years. I do not pretend to understand why this was chosen as the default behavior. – geekosaur Apr 26 '12 at 04:35
  • Actually, it's explained here: http://stackoverflow.com/questions/4493175/bash-sort-unusual-order-problem-with-spaces ... apparently UTF-8 (and perhaps other locales) ignore spaces while sorting. – Barton Chittenden Apr 26 '12 at 04:38
1

normalize your input a bit

tr [A-Z] [a-z] 

Try reading "Unix for poets"

abecadel
  • 329
  • 1
  • 10