1

I have a data set that I cut-n-pasted from a Google Spreadsheet into my text editor (Sublime Text 2), and the data set doesnt' quite match my needs for processing.

In the form that it has coming from the spreadsheet, the data starts with one line of strings, one for each column, and then a number of rows with data; in the data rows, each column has either the value 1 or is blank. I don't know if the data is tab separated when it comes from the spreadsheet, but after pasting it in the text file it is not. If the last 1in a row is not in the last column, the line is padded with spaces up until but not including the last column.

I tried doing something with awk, but I couldn't figure out how to tackle the fact that space is both separator and column value. Next, I tried a few commands with sed, including replacing repeated spaces with zeros and piping to another sed which replaced 10 with 1 0, but then I sometimes got extra zeros inserted and I don't know where in the respective rows that happened.

This is some example data (there are 13 columns in the real file). I've added $ as the character after the last one on the line, so you can see how far the lines are padded.

"1" "2" "3" "4"                           "1" "2" "3" "4"
  1 1 $                                   0 1 1 0
1     1 $                                 1 0 0 1
  1   $                                   0 1 0 0
1 1   1 $                                 1 1 0 1

I would like to end up with something like the right (and then I don't care about where the line ends) so I can process it with awk.

And by the way, I have seen this question, which doesn't solve my problem since the solution there is based on the fact that the file is tab-delimited, with no value at all in the "empty" cells. To reiterate, my file is space-delimited, with spaces in the empty cells.

Community
  • 1
  • 1
Tomas Aschan
  • 58,548
  • 56
  • 243
  • 402

3 Answers3

1

My first attempt was not ok. So my 2nd 3rd 4th try based on the modified input with determining the number of columns automatically:

awk 'NR==1{for(;N<NF;++N)sp=" 0"sp}NR>1{$0=" "$0;sub(" +$","");gsub("  "," 0");$0=substr($0sp,2,2*N-1)}1'<<EOT
"1" "2" "3" "4"
  1 1 
1     1 
  1   
1 1   1 
EOT

First spaces are even, in between ones are odd, so I added a space at the beginning to let to use the same gsub for both cases. It is not clear how many trailing spaces present, so the script just chomp them. It contains the number of 0 number of field times. Substr start from 2 to cut the added leading space, and lasts to (number of fields)*2-1 characters to cut the trailing space.

Output:

"1" "2" "3" "4"
0 1 1 0
1 0 0 1
0 1 0 0
1 1 0 1
TrueY
  • 7,360
  • 1
  • 41
  • 46
  • This is a good solution, but I'd prefer one that was independent on the number of fields, so I can reuse it for other data files with the same problem. Adding a variable (with e.g. `-v N=7`) is OK, if necesssary, but if the script itself was "NF-agnostic" it would be even better... – Tomas Aschan Apr 22 '13 at 19:25
  • @Thomas Lycken: Ok, code modified. The number of items can be determined from the first row. – TrueY Apr 22 '13 at 19:53
0

Trying to explain why a problem is hard is good for your chances to solve it. Just because I thought about the explanation here, I also came up with a solution =)

The solution works with sed and basically in three steps:

  1. Replace all empty first columns with 0:

    cat datafile.txt | sed 's/^ /0 /g'
    
  2. Replace all empty last columns with 0:

    cat datafile.txt | sed 's/^ /0 /g' | sed 's/  $/  0/g'
    

    Here, I had to experiment a little with the number of spaces in the regex, to align all the new zeros right.

  3. Replace all empty inner columns with 0:

    cat datafile.txt | sed 's/^ /0 /g' | sed 's/  $/  0/g' | sed 's/  / 0/g'
    

    Here I also experimented with placing the 0 first or last in the replacement regex to get it right.

And of course, after this is done I redirect output to a file by tagging on > datafile-clean.txt at the end.

There's probably a more elegant way to do this, so if you have one, please do post it even though I personally don't need the solution anymore.

Update: As shown in comments, this solution can be improved quite a lot. I'll leave the original solution up here, since I think it's more clear what it does, and in what order, but this should probably be used instead.

To start with, we don't need so many pipes; instead, we use the -e flag on sed:

sed -e 's/^  /0 /' -e 's/  $/ 0/' -e 's/  / 0/g' datafile.txt

This works just as-is, given that the first row with column headers doesn't have any double-spaces. If it does, one can just read the file instead with tail -n +2 datafile and pipe to the above sed command.

Tomas Aschan
  • 58,548
  • 56
  • 243
  • 402
  • Please, please, do not use `cat file|sed ...`! You just waste a resources for nothing. Use `sed ... – TrueY Apr 20 '13 at 20:04
  • Is it really needed the `g` for pattenrs like `/^.../` and `/...$/`? They can be matched only once. – TrueY Apr 20 '13 at 20:08
  • Unfortunately there is no case which match to the `/ $/` pattern. Last character is always a single space. – TrueY Apr 21 '13 at 22:11
  • @TrueY: Actually, there are cases with more than one space before the line ends - unfortunately, because of a poor choice of example lines, this didn't show in the data I presented. I've updated to reflect this. – Tomas Aschan Apr 21 '13 at 23:07
  • It is still not clear how many trailing spaces are added to each line. In the 2nd input line there is only one trailing space, but a zero has to be added. But in the 3rd line there is also one trailing space and if finished with one. So my solution not count on the number of trailing space. – TrueY Apr 22 '13 at 07:27
0
sed 's/ /0/g;s/10/1 /g;s/00/0 /g;s/$/0 /' datafile.txt | cut -c 1-7
Eran Ben-Natan
  • 2,515
  • 2
  • 16
  • 19