196

I have a CSV file that looks like this

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56

I need to sort it by line length including spaces. The following command doesn't include spaces, is there a way to modify it so it will work for me?

cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'
codeforester
  • 39,467
  • 16
  • 112
  • 140
gnarbarian
  • 2,622
  • 2
  • 19
  • 25
  • 34
    I'd really like to live in Binary Avenue or Ternary Street, those people certainly would agree with things like "8192 *is* a round number" – schnaader May 06 '11 at 22:20

13 Answers13

296

Answer

cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-

Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:

cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-

In both cases, we have solved your stated problem by moving away from awk for your final cut.

Lines of matching length - what to do in the case of a tie:

The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.

(Those who want more control of sorting these ties might look at sort's --key option.)

Why the question's attempted solution fails (awk line-rebuilding):

It is interesting to note the difference between:

echo "hello   awk   world" | awk '{print}'
echo "hello   awk   world" | awk '{$1="hello"; print}'

They yield respectively

hello   awk   world
hello awk world

The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:

"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"

 $1 = $1   # force record to be reconstituted
 print $0  # or whatever else with $0

"This forces awk to rebuild the record."

Test input including some lines of equal length:

aa A line   with     MORE    spaces
bb The very longest line in the file
ccb
9   dd equal len.  Orig pos = 1
500 dd equal len.  Orig pos = 2
ccz
cca
ee A line with  some       spaces
1   dd equal len.  Orig pos = 3
ff
5   dd equal len.  Orig pos = 4
g
neillb
  • 4,893
  • 1
  • 22
  • 18
  • 3
    heemayl, yes it is, thanks. I've tried to match the shape of OP's attempted solution where possible, to enable him to focus on only important differences between his and mine. – neillb Jan 07 '17 at 21:27
  • 3
    It's worth pointing out that `cat $@` is broken, too. You absolutely definitely want to quote it, like `cat "$@"` – tripleee Jul 18 '17 at 07:30
  • 1
    awk is probably ubiquitous and simplest, but the Python equivalent on *nix systems is ```python -c "for line in open('/dev/stdin'): print(len(line), line, end='')"``` :-) – Terry Brown Jan 06 '22 at 14:43
  • 1
    @TerryBrown they didn't want the line lengths printed in the output, just the original lines, but sorted. These two are cross-platform: `python -c "import sys; [print(x) for x in sorted(sys.stdin.read().splitlines(), key=len)]"` or `python -c "import sys;[sys.stdout.write(x) for x in sorted(sys.stdin, key=len)]"`. The first is robust against mixed newline styles, while the second requires less processing and preserves newline style, but might give confusing results on a file with mixed newline styles. – Mike Clark Jul 08 '22 at 00:22
56

The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:

perl -e 'print sort { length($a) <=> length($b) } <>'

You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.

In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.

Community
  • 1
  • 1
Caleb
  • 5,084
  • 1
  • 46
  • 65
  • 3
    This is better solution because awk causes unexpected sorting when the input file contains numeric and alfanumeric lines Here the oneline command: $ cat testfile | perl -e 'print sort { length($a) <=> length($b) } <>' – alemol Nov 06 '18 at 23:17
  • 3
    Fast! Did 465,000 line file (one word per line) in <1 second, when output redirected into another file - thus: `cat testfile.txt | perl -e 'print sort { length($a) <=> length($b) } <>' > out.txt` – cssyphus May 12 '20 at 18:44
  • 2
    Windows with StrawberryPerl works: `type testfile.txt | perl -e "print sort { length($a) <=> length($b) } <>" > out.txt` – bryc Jun 20 '20 at 15:29
  • This is one of the hidden jewels on computer languages. Thanks so much!! – riccs_0x Jul 17 '22 at 08:41
22

Benchmark results

Below are the results of a benchmark across solutions from other answers to this question.

Test method

  • 10 sequential runs on a fast machine, averaged
  • Perl 5.24
  • awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
  • The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)

Results

  1. Caleb's perl solution took 11.2 seconds
  2. my perl solution took 11.6 seconds
  3. neillb's awk solution #1 took 20 seconds
  4. neillb's awk solution #2 took 23 seconds
  5. anubhava's awk solution took 24 seconds
  6. Jonathan's awk solution took 25 seconds
  7. Fritz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.

Another perl solution

perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file
Chris Koknat
  • 3,305
  • 2
  • 29
  • 30
17

Try this command instead:

awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
Zombo
  • 1
  • 62
  • 391
  • 407
anubhava
  • 761,203
  • 64
  • 569
  • 643
7

Pure Bash:

declare -a sorted

while read line; do
  if [ -z "${sorted[${#line}]}" ] ; then          # does line length already exist?
    sorted[${#line}]="$line"                      # element for new length
  else
    sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
  fi
done < data.csv

for key in ${!sorted[*]}; do                      # iterate over existing indices
  echo -e "${sorted[$key]}"                       # echo lines with equal length
done
Fritz G. Mehner
  • 16,550
  • 2
  • 34
  • 41
7

Python Solution

Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).

python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'

Benchmark:

python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real    0m5.308s
user    0m3.733s
sys     0m1.490s

perl -e 'print sort { length($a) <=> length($b) } <>'
real    0m8.840s
user    0m7.117s
sys     0m2.279s
ThomasH
  • 1,085
  • 10
  • 12
5

The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).

awk '{ printf "%d:%s\n", length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]*://'

The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:

awk '{ print length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]* //'
Zombo
  • 1
  • 62
  • 391
  • 407
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
3

I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):

awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-
Markus Amalthea Magnuson
  • 8,415
  • 4
  • 41
  • 49
  • 4
    Hi, Markus. I don't observe line content (numeric or not) - as opposed to line-length - as having any effect on sorting except in the case of lines with matching lengths. Is this what you meant? In such cases, I did not find switching sort methods from `-n` to your suggested `-g` to yield any improvement, so I expect not. I have now addressed, in my answer, how to prohibit sub-sorting of equal-length lines (using `--stable`). Whether or not that was what you meant, thanks for bringing it to my attention! I've also added a considered input to test with. – neillb Jun 18 '15 at 02:08
  • 5
    No, let me explain by breaking it down. Just the `awk` part will generate a list of lines prefixed with line length and a space. Piping it to `sort -n` will work as expected. But if any of those lines already has a number at the beginning, those lines will start with length + space + number. `sort -n` disregards that space and will treat it as one number concatenated from length + number. Using the `-g` flag will instead stop at the first space, yielding a correct sort. Try it yourself by creating a file with some number-prefixed lines and run the command step by step. – Markus Amalthea Magnuson Nov 25 '16 at 10:13
  • 2
    I also found that `sort -n` disregards the space and produces an incorrect sorting. `sort -g` outputs the correct order. – r_31415 Dec 30 '16 at 23:16
  • 1
    I cannot reproduce the described issue with `-n` in `sort (GNU coreutils) 8.21`. The `info` documentation describes `-g` as less efficient and potentially less-precise (it converts numbers to floats), so probably don't use it if you don't need to. – phils May 15 '19 at 08:31
  • 2
    n.b. documentation for `-n`: "Sort numerically. The number begins each line and consists of optional blanks, an optional ‘-’ sign, and zero or more digits possibly separated by thousands separators, optionally followed by a decimal-point character and zero or more digits. An empty number is treated as ‘0’. The ‘LC_NUMERIC’ locale specifies the decimal-point character and thousands separator. By default a blank is a space or a tab, but the ‘LC_CTYPE’ locale can change this." – phils May 15 '19 at 08:37
  • 1
    Perhaps try `LC_ALL=C sort -n` – phils May 15 '19 at 08:40
3

With POSIX Awk:

{
  c = length
  m[c] = m[c] ? m[c] RS $0 : $0
} END {
  for (c in m) print m[c]
}

Example

Zombo
  • 1
  • 62
  • 391
  • 407
3

1) pure awk solution. Let's suppose that line length cannot be more > 1024 then

cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'

2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:

LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1

3

using Raku (formerly known as Perl6)

~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56

To reverse the sort, add .reverse in the middle of the chain of method calls--immediately after .sort(). Here's code showing that .chars includes spaces:

~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0

Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:

~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
    
    real    0m1.308s
    user    0m1.213s
    sys 0m0.173s
    
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-  > /dev/null
    
    real    0m1.189s
    user    0m1.170s
    sys 0m0.050s

HTH.

https://raku.org

jubilatious1
  • 1,999
  • 10
  • 18
  • 1
    My solution `.say for lines.sort({ $^a.chars <=> $^b.chars })` (inspired by the Perl answer) is a bit faster than yours. I don't know why. – Julia Jan 24 '23 at 17:55
  • 1
    Thank you! I think I was on an older version (Rakudo 2019?) at the time, or working on an older laptop. Note I'd probably write the solution today as: `raku -e 'put lines.sort(*.chars).join("\n");'`, taking care that `put` doesn't truncate anything as compared to `say` (potentially). Best Regards. – jubilatious1 Jan 25 '23 at 18:46
2

Here is a multibyte-compatible method of sorting lines by length. It requires:

  1. wc -m is available to you (macOS has it).
  2. Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
  3. testfile has a character encoding matching your locale (e.g., UTF-8).

Here's the full command:

cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-

Explaining part-by-part:

  • l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
  • cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
  • cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
  • close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
  • sub(/ */, "", c); ← trims white space from the character count value returned by wc.
  • { print c, $0 } ← prints the line's character count value, a space, and the original line.
  • | sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
  • | cut -d" " -f2- ← removes the prepended character count values.

It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.

Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).

Quinn Comendant
  • 9,686
  • 2
  • 32
  • 35
2

Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):

cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1     
user3479780
  • 525
  • 7
  • 18