3

I am coding using bash on terminal through a docker container on my mac. I am struggling to figure out how to remove the last 2 columns on my TSV file. It has 7 total and the last 2 are not needed for my work and are required to be removed.

Edit: The first picture is the original data file, the second is what the code is doing and it is deleting some random entries from the column. The third picture is what the end result of this program should do. The month and year columns I am struggling with also but I deleted the code and tried to simplify the data first.

I tried using awk and using NF = NF - 2 which does remove the last 2 columns but for some reason deletes some of the data I have in my 5th column which I need. So whilst I got the column deletion I needed, the code did a little extra. Here is the code:

preprocess() {
 31     input_file="$1"
 32 
 33     # Extract the base name of the input file
 34     base_name=$(basename "$input_file" .tsv)
 35 
 36     # Create the new output file name
 37     output_file="${base_name}_clean.tsv"
 38 
 39     awk -F'\t' 'BEGIN{OFS=FS} 
 40     {
 41         NF = NF - 2
 42 
 43         print
 44     }' "$input_file" > "$output_file"
 45 }

I Have a few other lines but they shouldn't cause any issues. They just check the file exists etc.

The original data file

The incorrect output

What it should look like. Will be doing the month and year column later and will need help with that also. Tried doing that first but couldn't get it to work so I wiped the code and thought I'd simplify the data first

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
Spoku
  • 49
  • 4
  • 2
    Please take a look at [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). – Cyrus May 20 '23 at 16:07
  • 1
    `csvcut -t -c 1-5 input.tsv` (csvcut is part of the csvkit package) – Shawn May 20 '23 at 16:15
  • 3
    please update the questino with the sample data file, the (wrong) output generated by your code and the (correct) expected output – markp-fuso May 20 '23 at 17:02
  • There is [some discussion here](https://unix.stackexchange.com/questions/450603/what-can-be-done-with-nf-in-an-awk-code-block-or-function) on modification directly of `NF`. Maybe it's relevant for you ? – MyICQ May 20 '23 at 17:22
  • 1
    Regarding "I tried using awk and using NF = NF - 2 which does remove the last 2 columns but for some reason deletes some of the data I have in my 5th column which I need." - `NF=NF-2` is undefined behavior so your awk could do anything, but chances are you have DOS line endings messing with what you see, see [why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it](https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it) – Ed Morton May 21 '23 at 14:59
  • If you [edit] your question to provide concise, testable sample input and expected output then we can best help you. – Ed Morton May 21 '23 at 15:00

5 Answers5

2

With AWK a script like this works:

Inputdata (separated by ; for clarity here, but could be tab also).

F1;F2;F3;F4
V11;V12;V13;V14
V21;V22;V23;V24

Program to convert. Comments to everyone can follow, even those new to awk.

BEGIN{
   FS=";"    # Convert til "\t" for TAB separation
   OFS=";"   # set as desired
   skipcolcount=2
}

{ 
  # In each line, loop over the fields
  for (i=1;i<=NF-skipcolcount;i++) {   
     printf $i                   # reference field by index variable  
     if (i<NF-skipcolcount) {    # no separator after last field
       printf OFS
     }
  }
  printf "\n"                    # linefeed after each line
}

Result:

F1;F2
V11;V12
V21;V22
MyICQ
  • 987
  • 1
  • 9
  • 25
  • This is more verbose than other solution modifiing `NF`, but allows more flexibility such as printing every other column, or skipping certain specific columns. – MyICQ May 21 '23 at 06:31
2

I tried using awk and using NF = NF - 2 which does remove the last 2 columns but for some reason deletes some of the data I have in my 5th column which I need.

This is unexpected for me, I did run your code using GNU Awk 5.1.0 and it works fine, however you are using

docker

so maybe this force usage of erratic version of awk? Anyway, if your task is given as

how to remove the last 2 columns on my TSV file. It has 7 total and the last 2 are not needed for my work and are required to be removed.

this might be simplified to: get first 5 columns of tab-separated file, which can be expressed in awk as

awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$4,$5}' file.tsv

Please run it and write if output is as desired.

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • Thank you so much @Daweo, that fixed the problem. Do you have any advice on how I can iterate through the data next to append the month and year column onto the end of the data file? The format and idea is quite a bit easier than to delete the columns I hope. – Spoku May 21 '23 at 11:25
  • 1
    @Spoku this is different task, please post separate question for that – Daweo May 21 '23 at 17:50
  • Short, but that line of code does not "remove the last two columns". Instead, it assumes you know you have 7 columns, remove two, and now print 1..5. Got the job done though. – MyICQ May 24 '23 at 05:12
1

The easiest will be rev/cut/rev combination

$ rev inputfile | cut -f3- | rev > output.file
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • For some reason this still removes some of my data out of the 5th column. It does remove the last 2 columns still. – Spoku May 21 '23 at 07:51
  • that's not possible unless you have malformed input data. Please try `cat -A inputfile | head` and post it to your question as text. Also if you sourced your file from DOS/Windows domain try `dos2unix inputfile` to fix line endings. – karakfa May 21 '23 at 11:55
1

You are close.

Given the following TSV file:

cat file
1   2   3   4   5
6   7   8   9   10
11  12  13  14  15

You can do this in awk:

awk 'BEGIN{FS=OFS="\t"}
{NF=3} 1
' file 

Prints:

1   2   3
6   7   8
11  12  13

Or Ruby:

ruby -ne 'puts $_.split("\t")[0..2].join("\t")' file
# same

Or Perl:

perl -nE 'say join("\t", (split "\t")[0..2])' file
# same
dawg
  • 98,345
  • 23
  • 131
  • 206
  • I've tried to use the awk one but It doesn't quite seem to work. Alot of the suggested adjustments below seem to continue to delete the same columns. Instead of deleting them and saving as a new file. Is it possible to copy the columns I need to the new file instead? – Spoku May 21 '23 at 08:00
  • 1
    Trying to reduce the value of `NF` is undefined behavior. Some awks will trim fields from the end, other will ignore it, others could do anything else (core dump? trim fields from the front? etc...) – Ed Morton May 21 '23 at 14:57
1

One way could be to use :

sed 's/\t[^\t]*\t[^\t]*$//' "$input_file" > "$output_file"

Match explained:

  • \t - a tab character
  • [^\t]* - zero or more non-tab characters
  • \t - a tab character
  • [^\t]* - zero or more non-tab characters
  • $ - end of line anchor

Substitute (the // part) with an empty string.

An alternative:

sed -E 's/(\t[^\t]*){2}$//' "$input_file" > "$output_file"

Here the match \t[^\t]* is in a group (...) which is repeated twice {2}.

Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108