2

I have a file with such lines:

street "City Name" 5 7500    30.3.2016
"Street Name"    city 4 1000   15.01.2015
<street name> <city name> <num of room> <price> <date>

I need to go over the file and sort it by some of the columns - like name price date etc.

I'm stuck with the white spaces in the middle of the line (there can be multiple white spaces between each parameter) and between the strings (can be 1 word or 2 or more) and in the beginning of the word (and I cannot use sed).

Can anyone offer me a solution to lose the multiple white spaces so I'll be left with lines such as:

street "City Name" 5 7500 3.30.2016
"Street Name" city 4 1000 01.15.2015
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 2
    It's the *quotes*, not the whitespace, that make this an interesting problem. – Charles Duffy May 24 '16 at 23:35
  • ...and the output format you *say* you want, with quotes but still using spaces, won't be any easier to sort with standard UNIX tools than your current format is. A tab or other distinct delimiter is the right tool for the job. – Charles Duffy May 24 '16 at 23:40
  • 1
    @CharlesDuffy: No-one's noticed that the dates are converted from d-m-y to m-d-y format yet. At least, the second was before I edited things; the first was transformed more subtly (from `30.3.2016` to `3.28.2016`). I made the numbers consistent — but the OP should clarify whether or not this transform is part of the question. – Jonathan Leffler May 25 '16 at 00:03

3 Answers3

2

The following will transform your file into tab-delimited form, where sort or other standard tools can handle it trivially:

while read -r line; do
  printf '%s\n' "$line" | xargs printf '%s\t'
  echo
done

This works because xargs parses quotes and whitespace, breaking each line into its individual elements, and then passes each element to printf '%s\t', which prints those elements with tabs between them; the echo then adds newlines between the lines of output.

The output can then be fed into something like the following:

sort -t $'\t' -k2,2 -k1,1

...which will sort the tab-delimited columns, first on the second key (city, in your example), then on the first (street name, in your example).


Let's take the below input file, which will make behavior clearer than was the case with the original proposal:

"Street A" "City A" 1
"Street B" "City B" 2
"A Street" "City A" 3
"B Street" "City B" 4
"Street A" "A City" 5
"Street B" "B City" 6
Street City 7

Run through the above, with LANG=C sort -s -t$'\t' -k2,2 -k1,1 | expand -t16, -- thus sorting first by city, then by street, then printing with 16-space tabstops -- and output is as follows:

Street A        A City          5
Street B        B City          6
Street          City            7
A Street        City A          3
Street A        City A          1
B Street        City B          4
Street B        City B          2

By contrast, use LANG=C sort -s -t$'\t' -k1,1 -k2,2 | expand -t16 to sort first by street and then by city (and print with 16-space tabs), and you get the following:

A Street        City A          3
B Street        City B          4
Street          City            7
Street A        A City          5
Street A        City A          1
Street B        B City          6
Street B        City B          2

If you want to go back from the tab-delimited format to the quoted format, this is feasible too:

#!/bin/bash
#      ^^^^- Important, not /bin/sh

while IFS=$'\t' read -r -a cols; do
  for col in "${cols[@]}"; do
    if [[ $col = *[[:space:]]* ]]; then
      printf '"%s" ' "$col"
    else
      printf '%s ' "$col"
    fi
  done
  printf '\n'
done

Taking your original input and running it through the first script (to convert to tab-delimited form), then sort -t$'\t' -k1,1 -k2,2 (to sort in that form), then this second script (to convert back to whitespace separators with quotes), yields the following:

"A Street" "City A" 3
"B Street" "City B" 4
Street City 7
"Street A" "A City" 5
"Street A" "City A" 1
"Street B" "B City" 6
"Street B" "City B" 2
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
1

You could use tr with the -s flag (which means squeeze)

echo "  a sentence       with lots of    spaces" | tr -s " "

and if you want to remove the initial space just pipe it through cut

echo "  a sentence       with lots of    spaces" | tr -s " " | cut -d ' ' -f2- 

EDIT: As Charles Duffy suggested you could use sed instead, to protect you in case there is no leading space

echo "  a sentence       with lots of    spaces" | tr -s " " | sed -re 's/^ +//'
Alberto Zaccagni
  • 30,779
  • 11
  • 72
  • 106
1

Give a try to this:

awk -F \" -v OFS=\" '{for (i=1; i<=NF; i=i+2) while (sub(/  /," ",$i)) ; print}' afile1

The goal is to leave a string enclosed in 2 " unchanged, and to replace multiple spaces outside 2 " with a single one.

-v OFS=\" defines " as field separator for output, when print is used.

-F \" defines " as field separator for input lines read. Each line is splitted according to " in several elements which are stored in $1 $2 etc. variables.

As a consequence, odd fields ($1, $3 etc.) are outside 2 ", right?

NF is the number of elements found in the current line, after the split.

The for statement is looping over odd fields only. gsub replaces all multiple spaces in odd fields with a single space.

the test:

$ awk -F \" -v OFS=\" '{for (i=1; i<=NF; i=i+2) gsub(/  */," ",$i) ; print}' afile
street "City Name" 5 7500 30.3.2016
"Street Name" city 4 1000 15.01.2015
<street name> <city name> <num of room> <price> <date>
Jay jargot
  • 2,745
  • 1
  • 11
  • 14