Delete columns from space delimited file where file header matches

Question

I have a space delimited input text file. I would like to delete columns where the column header is size using sed or awk.

Input File:

id quantity colour shape size colour shape size colour shape size
1 10 blue square 10 red triangle 8 pink circle 3
2 12 yellow pentagon 3 orange rectangle 9 purple oval 6

Desired Output:

id quantity colour shape colour shape colour shape
1 10 blue square red triangle pink circle
2 12 yellow pentagon orange rectangle purple oval

Do you know at which positions the `size` columns are? – Jul 19 '12 at 10:39 — , Jul 19 '12 at 10:39

kev · Accepted Answer · 2012-07-19T10:35:39.603

6

`awk` command

awk '
NR==1{
    for(i=1;i<=NF;i++)
        if($i!="size")
            cols[i]
}
{
    for(i=1;i<=NF;i++)
        if(i in cols)
            printf "%s ",$i
    printf "\n"
}' input > output

pretty printing

column -t -s ' ' output

result

id  quantity  colour  shape     colour  shape      colour  shape
1   10        blue    square    red     triangle   pink    circle
2   12        yellow  pentagon  orange  rectangle  purple  oval

edited Jul 19 '12 at 10:35

answered Jul 19 '12 at 10:29

kev

155,172
47
273
272

Perfect. Just what I was looking for – Santosh Pillai Jul 19 '12 at 11:09

Birei · Answer 2 · 2012-07-18T15:33:56.787

A general solution using awk. There is a hard-coded variable (columns_to_delete) in the BEGIN block to indicate positions of fields to delete. The script then will calculate the width of each field and will delete those that match the position of the variable.

Assuming infile has the content of the question and following content of script.awk:

BEGIN {
    ## Hard-coded positions of fields to delete. Separate them with spaces.
    columns_to_delete = "5 8 11"

    ## Save positions in an array to handle it better.
    split( columns_to_delete, arr_columns )
}


## Process header.
FNR == 1 { 

    ## Split header with a space followed by any non-space character.
    split( $0, h, /([[:space:]])([^[:space:]])/, seps )

    ## Use FIELDWIDTHS to handle fixed format of data. Set that variable with
    ## length of each field, taking into account spaces.
    for ( i = 1; i <= length( h ); i++ ) { 
        len = length( h[i] seps[i] )
        FIELDWIDTHS = FIELDWIDTHS " " (i == 1 ? --len : i == length( h ) ? ++len : len)
    }   

    ## Re-calculate fields with new FIELDWIDTHS variable.
    $0 = $0
}

## Process header too, and every line with data.
{
    ## Flag to know if 'p'rint to output a field.
    p = 1 

    ## Go throught all fields, if found in the array of columns to delete, reset
    ## the 'print' flag.
    for ( i = 1; i <= NF; i++ ) { 
        for ( j = 1; j <= length( arr_columns ); j++ ) { 
            if ( i == arr_columns[j] ) { 
                p = 0 
                break
            }   
        }   

        ## Check 'print' flag and print if set.
        if ( p ) { 
            printf "%s", $i
        }
        else {
            printf " " 
        }
        p = 1 
    }   
    printf "\n"
}

Run it like:

awk -f script.awk infile

With following output:

id  quantity colour shape    colour shape      colour  shape    
1   10       blue   square   red    triangle   pink    circle   
2   12       yellow pentagon orange rectangle  purple   oval

EDIT: Oh oh, just now realised that output is not right, because of a join between two fields. Fix that would be too much work because there will be to check the max column size for every line before starting to process anything. But with this script I hope you get the idea. Not time now, perhaps I can try to fix it later on, but not sure.

EDIT 2: Fixed adding an additional space for each field deleted. It was easier than expected :-)

EDIT 3: See comments.

I've modified the BEGIN block to check that an extra variable is provided as argument.

BEGIN {
    ## Check if a variable 'delete_col' has been provided as argument.
    if ( ! delete_col ) { 
        printf "%s\n", "Usage: awk -v delete_col=\"column_name\" -f script.awk " ARGV[1]
        exit 0
    }   

}

And added to FNR == 1 pattern the process of calculating the numbers of the columns to delete:

## Process header.
FNR == 1 { 

    ## Find column position to delete given the name provided as argument.
    for ( i = 1; i <= NF; i++ ) { 
        if ( $i == delete_col ) { 
            columns_to_delete = columns_to_delete " " i
        }   
    }   

    ## Save positions in an array to handle it better.
    split( columns_to_delete, arr_columns )

    ## ...
    ## No modifications from here until the end. Same code as in the original script.
    ## ...
}

Now run it like:

awk -v delete_col="size" -f script.awk infile

And result will be the same.

Is there any way to do this without hard coding the column numbers (using column header names)? — Santosh Pillai, Jul 18 '12 at 15:01
I meant the output does not have any separators/delimiter (there no space between the columns) — Santosh Pillai, Jul 18 '12 at 20:56
@SantoshPillai: First, the example file of previous comment seems different from what you pasted in the question. It has different format, or to be clear, no format at all, only spaces. I think `awk` could do that job simpler than this script. Second, I ran a test and fields are separated by spaces in output file too, I don't know what you mean. Give us a good (and short) example of an input file and show us the case where this script fails. It will be easier for me and other users too to be able to help you. — Birei, Jul 18 '12 at 21:17
My apologies. I had added extra spaces to input data in my question to make it readable (I have removed this now). Your script removes the size columns successfully, but it also removes the spaces between other consecutive data columns. I am getting the output idquantitycolourshape colourshape colourshape 110bluesquare redtriangle pinkcircle 212yellowpentagon orangerectangle purpleoval — Santosh Pillai, Jul 18 '12 at 22:24
@SantoshPillai: But now the question is different from the original. My script won't work because it assumed that columns size was fixed. This answer has votes because it tried to solve a different problem and it's not right to edit the answer again. It needs a different approach. With this question, you could keep it here if you think it could help anyone else, or delete it. But create a new one with your new specifications. — Birei, Jul 19 '12 at 09:37

score 1 · Answer 3 · answered Jul 19 '12 at 10:25

1

Use cut:

$ cut -d' ' -f1-4,6,7,9,10 < in.txt   
id quantity colour shape colour shape colour shape
1 10 blue square red triangle pink circle
2 12 yellow pentagon orange rectangle purple oval

answered Jul 19 '12 at 10:25

I was looking for header match. Solution by @kev does the job – Santosh Pillai Jul 19 '12 at 11:20

score 0 · Answer 4 · answered Jul 18 '12 at 11:16

0

Given a fixed file format:

cut -f 1-4,6-7,9-10 infile

answered Jul 18 '12 at 11:16

Karl Nordström

327
1
8

The file format is not fixed so would like to delete where there is header match – Santosh Pillai Jul 18 '12 at 11:38

score 0 · Answer 5 · answered Jul 19 '12 at 12:03

If you have GNU cut available this can be done like so:

columns=$(head -n1 INPUT_FILE \
          | tr ' ' '\n'       \
          | cat -n            \
          | grep size         \
          | tr -s ' '         \
          | cut -f1           \
          | tr -d ' '         \
          | paste -sd ",")

cut --complement -d' ' -f$columns INPUT_FILE

Which generates a comma separated list based on the heading, and then cuts the complement of that list from INPUT_FILE.

Delete columns from space delimited file where file header matches

5 Answers5

`awk` command

pretty printing

result

Linked

Delete columns from space delimited file where file header matches

5 Answers5

awk command

pretty printing

result

Linked

`awk` command