1

I need to work across the columns of a large .tsv and replace the column name if it matches any of a number of strings, labelling it an error if no match is found. Below is a simplified version of what I have, and it works here.

Sample tab-separated input test.tsv:

Col1    Col2    Col3    Col4
A   B   C   Foo
D   E   F   Bar
G   H   I   Baz

Script:

#!/bin/bash

set -eu
shopt -s failglob

awk 'BEGIN {FS=OFS="\t"} \
     {if (NR==1) \
        {for (i = 1 ; i <= NF ; i++) \
        if ($i == "Col1") { $i = "NewCol1" } \
        else if ( $i == "Col2") { $i = "NewCol2" } \
        else if ( $i == "Col4") { $i = "NewCol4" } \
        else { $i = "Error: "$i } \
        } print \
        }' test.tsv

Tab-separated output:

NewCol1 NewCol2 Error: Col3 NewCol4
A   B   C   Foo
D   E   F   Bar
G   H   I   Baz

However, in my real process Col4 is not being successfully processed. Instead, it is being flagged as an error. The issue does not occur if I use LibreOffice Calc to open the file and save it again, still as .tsv. This makes me think it may be a line ending format issue, but I have used vim to check the endings in the input file, and they are consistently \n. What am I missing here?

tesolat
  • 81
  • 7
  • 2
    I wouldn't give up on your line-ending theory quite yet. Try checking the file (or the first few lines, at least) with `cat -t`. – jas Nov 08 '19 at 19:52
  • With the sample and code provided, it works just fine. The error you get is in `Col3` because you have to add `else if ( $i == "Col3") { $i = "NewCol3" } \`. between the Col2 and Col4 management. – ingroxd Nov 08 '19 at 19:54
  • @ingroxd - yeah, the error in there is expected - the code works fine, as I said in the post, but there is another issue. @jas `cat -t` is showing `^M` as the line ending on my real input file. What's the significance of this? The real input file has not been on a Windows system though. – tesolat Nov 08 '19 at 20:08
  • 2
    @tesolat: If you are using `gnu-awk` then use `awk -v RS='\r?\n' '...' file` – anubhava Nov 08 '19 at 20:16
  • Okay, got there in the end. @anubhava fixed it directly (though was the `?` in the `RS` in error?) and @jas pointed me to `^M`, which can be processed out using `sed`. Etiquette-wise, how to give thanks here? Answer it myself and give kudos to you both? – tesolat Nov 08 '19 at 20:37
  • @tesolat the `?` is present so gawk will split the file on `\n`s whether preceded by `\r` or not. The downside to that is that then CSVs exported from Excel which have `\n` within fields will get mangled. See https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it and https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk. – Ed Morton Nov 08 '19 at 23:34
  • 1
    Both links are very useful @EdMorton. Neither showed in the searches I made prior to asking the question though :/ Hopefully setting the problem in this particular context (awk, column processing) may be helpful to someone else! – tesolat Nov 09 '19 at 17:52

1 Answers1

0

Evidently your input has line ending with ^M or \r. You may use this awk with a custom regex in RS (this requires GNU awk):

awk -v RS='\r?\n' '...' file

-v RS='\r?\n' sets RS or record separator as optional \r followed by \n, thus allowing it to match lines ending with \n or \r\n.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
anubhava
  • 761,203
  • 64
  • 569
  • 643