Extract several space-delimited fields from file with varying delimiters into another file in Bash

Question

I have a Unicode/UTF-8 text file from some third-party Windows software that contains about ten columns of data.

The header line is tab-delimited. However, the remaining lines are space-delimited (not tab-delimited!) (as seen when opening the file in Notepad++ or TextWrangler).

Here are the first four lines of the file (as an example): x y z(ns) z(cm) z-abs(cm) longitude- E latitude- N type_of_object description 728243.03 5993753.83 0 0 0 143.537779835969 -36.1741232463362 linestart DRIVEWAYGRAVEL 728242.07 5993756.02 0 0 0 143.537768534943 -36.1741037476109 line DRIVEWAYGRAVEL 728242.26 5993756.11 0 0 0 143.537770619485 -36.1741028922293 linestart DRIVEWAYGRAVEL

x       y   z(ns)       z(cm)   z-abs(cm)   longitude-  E   latitude-   N   type_of_object  description
 728243.03     5993753.83    0             0             0             143.537779835969           -36.1741232463362           linestart     DRIVEWAYGRAVEL
 728242.07     5993756.02    0             0             0             143.537768534943           -36.1741037476109           line          DRIVEWAYGRAVEL
 728242.26     5993756.11    0             0             0             143.537770619485           -36.1741028922293           linestart     DRIVEWAYGRAVEL

(n.b. the space at the start of each line except for the header line)

I'm trying to write a Bash script to reformat the data for import into a different Windows program.

(I realise I could do this on the Windows command line, but I have no experience with it, so would prefer to copy the file onto my Debian machine and create a script in Bash. This means the input file and output file need to be compatible with Windows, but the script itself is obviously running in Linux.)

I need to do the following:

Extract the first two columns (x and y coordinates) but ONLY for lines containing "rectangle" in the second-last column, using a comma delimiter.
Add either a 1 or a 0 at the end of each line. The first line should have a 1, the 2-4th lines should have a 0, the 5th line should have a 1, 6-8th lines should have a 0, and so on. That is, every fourth line (starting at the first line) should have a 1, and every other line should have a 0.

So the output file should look something like this:

728257.89,5993759.24,1
728254.83,5993758.54,0
728251.82,5993762.4,0
728242.45,5993765.07,0

I have tried the answer to this question. e.g.

awk '
NR==1{
    for(i=1;i<=NF;i++)
        if($i!="z(ns)")
            cols[i]
}
{
    for(i=1;i<=NF;i++)
        if(i in cols)
            printf "%s ",$i
    printf "\n"
}' input.file > output.file

...to remove the third column (and then variations on this to get rid of the other unwanted columns). However, all I'm left with is just an empty output file.

I also tried hacking together a solution with grep and awk:

touch output.txt
count=0
IFS=$'\n'
set -f #disable globbing
for i in $( grep "rectangle" $inputFile )
do
    Xcoord=$(awk 'BEGIN { FS=" " } { print $1 }' $i )
    printf "$Xcoord" >> output.txt
    echo ","
    Ycoord=$(awk 'BEGIN { FS=" " } { print $2 }' $i )
    printf "$Ycoord" >> output.txt
    printf ","
    count=$((count+1))
    if [[ count = "1" ]]
    then
        printf "$count\n" >> output.txt
    else
        printf "0\n" >> output.txt
    fi
done
set +f #re-enable globbing for future use of the terminal.

...the idea behind this was: -For each line in $inputFile that contains "rectangle"

1. Append the first column (variable "Xcoord") to output.txt
2. Append a comma to output.txt
3. Append the second column (variable "Ycoord") to output.txt
4. Append another comma to output.txt
5. Append the 1 or 0 as per the if test based on the value of the variable "count", along with a new line.

This idea fails. Instead of saving the data to the file, it prints all columns of the file to stdout, with the first column replaced with the text "(No such file or directory)":

...and output.txt is just full of zeros:

How can I fix this?
Do I need to do anything to make the resulting output.txt file Windows-format?

Thanks in advance...

eraenderer · Answer 1 · 2018-08-16T14:28:27.307

1

I think awk is capable of all you need in one line:

 awk -F '[[:space:]][[:space:]]+' 'BEGIN{OFS = ","} {if ($8 == "rectangle") print $1, $2 }' a.txt | awk 'BEGIN{OFS = ","}{if((NR+3)%4) print $0,0;else print $0,1}'

You set set delimiter between entries to "at least two spaces" by

-F '[[:space:]][[:space:]]+

set your output separator to comma by

'BEGIN{OFS = ","}

check for your rectangle condition in the second last column

if ($8 == "rectangle")

and print the columns you want as output

print $1, $2

To add your 0,1 pattern in the third output column you have to restart awk to get the line numbering of your result file, not the original input lines. The awk NR variable contains the line number starting at 1.

(NR+3)%4

(% is modulo-operation)results to 0 (=false) for line number 1,5,9,... so you just have to print the complete line (variable $0 ) followed by a 0 in the if-case and a 1 in the else case.

print $0,0;else print $0,1

Hope that's what you wanted.

edited Aug 16 '18 at 14:28

answered Aug 16 '18 at 14:23

eraenderer

107
5

Hi @eraenderer. Thanks for this answer and detailed explanation. All makes sense. However, I can't figure out how to specify the input file in your code... could you please clarify how to tell your awk command which file to use as input? Thanks. (I'm assuming a.txt is the output file...?) – ZPMMaker Aug 16 '18 at 22:59
I did some experimenting to try and figure it out based on a bunch of awk guides I found elsewhere... `awk -F '[[:space:]][[:space:]]+' 'BEGIN{OFS = ","} {if ($8 == "rectangle") print $1, $2 }' $inputFile | awk 'BEGIN{OFS = ","}{if((NR+3)%4) print $0,0;else print $0,1 >> "output2.txt" }'` ...that just results in an empty output file. Then I tried: `awk -F '[[:space:]][[:space:]]+' 'BEGIN{OFS = ","} { if ($8 = "rectangle") { print $1, $2 >> "output.txt" } }' $inputFile` ...this just grabs the input file, and outputs each line with a comma on the end into output.txt. The If $8 fails – ZPMMaker Aug 17 '18 at 00:39
Sorry, comment length was too long. I meant to say at the end "The if ($8 == "rectangle") doesn't seem to be working. – ZPMMaker Aug 17 '18 at 00:40
OK, some more experimenting... `awk -F '[[:space:]][[:space:]]+' 'BEGIN{OFS = ","} { if ($8 = "rectangle") { print $1 >> "output.txt" } }' $inputFile` does the same thing (prints the whole input file to output file, with a comma at the end of each line. Column 8 filtering is not being applied). `awk -F '[[:space:]][[:space:]]+' 'BEGIN{OFS = ","} { if ($8 = "rectangle") { print $2 >> "output.txt" } }' $inputFile` does the exact same thing (note the change in print $1 to print $2... it should change columns, but it isn't. Perhaps the field separator isn't correctly defined? – ZPMMaker Aug 17 '18 at 00:45
Hi @ZPMMaker you were right about the input. It is a.txt in this example. – eraenderer Aug 17 '18 at 07:49
I did not notice you wanted to get windows line endings. You could add them by using `BEGIN{OFS = ",";ORS="\r\n"}` at the start of the second awk call. – eraenderer Aug 17 '18 at 09:17
Apart from that my line works for me on your example data you posted in your question if i add some lines containing "rectangle". Could you copy your example data form above and check it on this data? I see two possible sources of error: 1. different awk implementation. (check `awk --version` I am on GNU awk 4.1.3) 2. Your data is not really containing that many spaces. Have you checked the conversion of [tabs to space in Notepad++?](https://stackoverflow.com/questions/455037/convert-tabs-to-spaces-in-notepad) – eraenderer Aug 17 '18 at 09:29

score 0 · Accepted Answer · edited Aug 17 '18 at 05:53

I figured out a solution.

Remove the header line.
Filter all lines based on the word "rectangle" using grep.
Replace whitespaces with commas to make it easier to deal with.
Iterate through each line, saving to file as needed.

#!/bin/bash
#Code here to retrieve the file from command arguments and set it as $inputFile (removed for brevity)
sed -i 1d $inputFile #Remove header line

sed 's/^ *//g' < $inputFile > work.txt #Remove first character in each line (a space).
tr -s ' ' <work.txt | tr ' ' ',' >work2.txt #Switch spaces for commas.
grep "rectangle" work2.txt > work3.txt #Print all lines containing "rectangle" in them to new file.
rm lineout.txt #Delete output file in case script was run previously.
touch lineout.txt
count=0
while IFS='' read -r line || [[ -n "$line" ]]; do
    printf "$line" > line.txt
    awk 'BEGIN { FS="," } { printf $1  >> "lineout.txt" }' line.txt
    printf "," >> lineout.txt
    awk 'BEGIN { FS="," } { printf $2  >> "lineout.txt" }' line.txt
    printf "," >> lineout.txt
    count=$((count + 1))
    if [[ $count = "1" ]]
    then
        printf "$count\n" >> lineout.txt
    else
        printf "0\n" >> lineout.txt
        if [[ $count = "4" ]]
        then
            count=0
        fi
    fi
done < work3.txt

You are successfully using pipes elsewhere, why do you need so many temporary files? `grep moo | tr x y >tmp; grep bar tmp` might as well be refactored to `grep moo | tr x y | grep bar`. Similarly, if you put the redirect last `done >lineout.txt` you don't need to zap the output file before the loop and your code will run more efficiently because the shell doesn't need to close the file, open for appending, and seek to the end on every iteration. — tripleee, Aug 17 '18 at 05:55
@tripleee, thanks, but I don't understand your comment... "grep moo | tr x y >tmp; grep bar tmp" doesn't appear in my code at all so I don't get what you're saying. Could you please provide more detail? Thanks. :) — ZPMMaker, Aug 19 '18 at 06:11
It's a generalization. But look at it from the other side then: `tr -s ' ' work2.txt` is equivalent to `tr -s ' ' tmp4; tr ' ' ',' work2.txt; rm tmp4` where you create *yet* another useless temporary file and lose the efficiency of parallel processing that a pipeline provides. — tripleee, Aug 19 '18 at 06:32

score 0 · Answer 3 · answered Aug 17 '18 at 08:16

This can be formatted easily using sublime text editor which is capable of:

Multiple Selection
Vertical Selection
Search and Replace text similar to bash expression

I'm not trying to advertise sublime but this tool definitely solved most of my text editing problems.

Extract several space-delimited fields from file with varying delimiters into another file in Bash

3 Answers3