How do I delete all lines in a concatenated text file that match the header WITHOUT deleting the header? [bash]

Question

My apologies if this question already exists out there. I have a concatenated text file that looks like this:

#Chr    start   end ID  GTEX-Q2AG   GTEX-NPJ8
1   1   764484  783034  1:764484:783034:clu_2500_NA 0.66666024153854    -0.194766358934969
2   1   764484  787307  1:764484:787307:clu_2500_NA -0.602342191830433  0.24773430748199
3   1   880180  880422  1:880180:880422:clu_2501_NA -0.211378452591182  2.02508282380949
4   1   880180  880437  1:880180:880437:clu_2501_NA 0.231916912049866   -2.20305649485074
5   1   889462  891303  1:889462:891303:clu_2502_NA -2.3215482460681    0.849095194607155
6   1   889903  891303  1:889903:891303:clu_2502_NA 2.13353943689806    -0.920181808417383
7   1   899547  899729  1:899547:899729:clu_2503_NA 0.990822909478346   0.758143648905368
8   1   899560  899729  1:899560:899729:clu_2503_NA -0.938514081703866  -0.543217522714283
9   1   986217  986412  1:986217:986412:clu_2504_NA -0.851041440248378  0.682551011244202

The first line, #Chr start end ID GTEX-Q2AG GTEX-NPJ8, is the header, and because I concatenated several similar files, it occurs multiple times throughout the file. I would like to delete every instance of the header occuring in the text without deleting the first header

BONUS I actually need help with this too and would like to avoid posting another stack overflow question. The first column of my data was generated by R and represents row numbers. I want them all gone without deleting #Chr. There are too many columns and it's a problem.

This problem is different than ones recommended to me because of the above additional issue and also because you don't necessarily have to use regex to solve this problem.

If you have different questions, post a new question. That's what this site is about. — glenn jackman, Jan 22 '19 at 17:48
@glennjackman I'm worried about being punished for asking a bad question but I searched for this and couldn't find it. — CelineDion, Jan 22 '19 at 18:05
Note that both of these issues can be prevented in the concatenation process. You might want to reconsider how you concatenate the files. — glenn jackman, Jan 22 '19 at 19:40
It needs to be said again: If you have two questions, post two separate questions. Linking between them is fine. You are *more* likely to get "punished" for violating the "one question per question" guidance than for posting multiple questions, even if they turn out to be duplicates of existing questions. If you can show us what you already searched for and how none of the answers you found worked for you, that's an excellent question. — tripleee, Jan 23 '19 at 03:38

score 1 · Accepted Answer · answered Jan 22 '19 at 17:54

The following AWK script removes all lines that are exactly the same as the first one.

awk '{ if($0 != header) { print; } if(header == "") { header=$0; } }' inputfile > outputfile

It will print the first line because the initial value of header is an empty string. Then it will store the first line in header because it is empty.

After this it will print only lines that are not equal to the first one already stored in header. The second if will always be false once the header has been saved.

Note: If the file starts with empty lines these empty lines will be removed.

To remove the first number column you can use

sed 's/^[0-9][0-9]*[ \t]*//' inputfile > outputfile

You can combine both commands to a pipe

awk '{ if($0 != header) { print; } if(header == "") { header=$0; } }' inputfile | sed 's/^[0-9][0-9]*[ \t]*//' > outputfile

Thank you for providing such a thorough answer – CelineDion Jan 22 '19 at 18:04 — CelineDion, Jan 22 '19 at 18:04

score 1 · Answer 2 · answered Jan 22 '19 at 18:15

maybe this helpful:

delete all headers
delete first column
add first header

cat foo.txt
#Chr    start   end ID  GTEX-Q2AG   GTEX-NPJ8
1   1   764484  783034  1:764484:783034:clu
#Chr    start   end ID  GTEX-Q2AG   GTEX-NPJ8
2   1   764484  783034  1:764484:783034:clu
#Chr    start   end ID  GTEX-Q2AG   GTEX-NPJ8
3   1   764484  783034  1:764484:783034:clu

sed '/#Chr    start   end ID  GTEX-Q2AG   GTEX-NPJ8/d' foo.txt | awk '{$1 = ""; print $0 }' | sed '1i #Chr    start   end ID  GTEX-Q2AG   GTEX-NPJ8'

#Chr    start   end ID  GTEX-Q2AG   GTEX-NPJ8
 1 764484 783034 1:764484:783034:clu
 1 764484 783034 1:764484:783034:clu
 1 764484 783034 1:764484:783034:clu

glenn jackman · Answer 3 · 2019-01-22T19:38:39.830

0

I would do

awk 'NR == 1 {header = $0; print} $0 != header' file

edited Jan 22 '19 at 19:38

answered Jan 22 '19 at 17:54

glenn jackman

238,783
38
220
352

1

You may want to include the first occourrence of the header with a print in `NR == 1 { header = $0; print }` – etuardu Jan 22 '19 at 17:58
Quite right, thanks.. – glenn jackman Jan 22 '19 at 19:38

score 0 · Answer 4 · answered Jan 22 '19 at 17:59

0

Using sed

sed '2,${/HEADER/d}' input.txt > output.txt

Command explained:

Starting at line 2: 2,
Search for any line matching 'HEADER' /HEADER
Delete it /d

answered Jan 22 '19 at 17:59

brunorey

2,135
1
18
26

How do I delete all lines in a concatenated text file that match the header WITHOUT deleting the header? [bash]

4 Answers4