1

I have the following input file:

a 1  o p
b  2 o p p
c     3 o p p  p

in the last line there is a double space between the last p's, and columns have different spacing

I have used the solution from: Using awk to print all columns from the nth to the last.

awk '{for(i=2;i<=NF;i++){printf "%s ", $i}; printf "\n"}'

and it works fine, untill it reaches double-space in the last column and removes one space.

How can I avoid that while still using awk?

Community
  • 1
  • 1
meso_2600
  • 1,940
  • 5
  • 25
  • 50

4 Answers4

4

Since you want to preserve spaces, let's just use cut:

$ cut -d' ' -f2- file
1 o p
2 o p p
3 o p p  p

Or for example to start by column 4:

$ cut -d' ' -f4- file
p
p p
p p  p

This will work as long as the columns you are removing are one-space separated.


If the columns you are removing also contain different amount of spaces, you can use the beautiful solution by Ed Morton in Print all but the first three columns:

awk '{sub(/[[:space:]]*([^[:space:]]+[[:space:]]+){1}/,"")}1'
                                                   ^
                                        number of cols to remove

Test

$ cat a
a 1 o p
b    2 o p p
c  3 o p p  p
$ awk '{sub(/[[:space:]]*([^[:space:]]+[[:space:]]+){2}/,"")}1' a
o p
o p p
o p p  p
Community
  • 1
  • 1
fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • Does cut support multiple field delims ? –  Apr 08 '15 at 12:41
  • cant use cut, must use awk – meso_2600 Apr 08 '15 at 12:41
  • @JID you need to pipe to `tr -s ' '` beforehand. – fedorqui Apr 08 '15 at 12:42
  • @meso_2600 are the columns you want to remove just one-space separated? – fedorqui Apr 08 '15 at 12:43
  • @fedorqui must use awk, columns have different width – meso_2600 Apr 08 '15 at 12:44
  • I would perhaps replace the last `+` in the regex with a `*`, so that n fields can be removed from a line with n fields. What should happen if there are fewer fields in the line than should be cut off? – Wintermute Apr 08 '15 at 13:03
  • Yes, if you have to handle removing all fields then change the last `+` to ` `*`. If you have to handle removing all fields when less than the specified number are present then that probably will require a change to the RE too but I can't be bothered to think about it... – Ed Morton Apr 08 '15 at 13:30
3

GNU sed

remove first n fields

sed -r 's/([^ ]+ +){2}//' file

GNU awk 4.0+

awk '{sub("([^"FS"]"FS"){2}","")}1' file

GNU awk <4.0

awk --re-interval '{sub("([^"FS"]"FS"){2}","")}1' file

Incase FS one doesn't work(Eds suggestion)

awk '{sub(/([^ ] ){2}/,"")}1' file

Replace 2 with number of fields you wish to remove

EDIT

Another way(doesn't require re-interval)

awk '{for(i=0;i<2;i++)sub($1"[[:space:]]*","")}1' file

Further edit

As advised by EdMorton it is bad to use fields in sub as they may contain metacharacters so here is an alternative(again!)

awk '{for(i=0;i<2;i++)sub(/[^[:space:]]+[[:space:]]*/,"")}1' file

Output

o p
o p p
o p p  p
  • then do it in awk. sed has `s`, awk has `sub()`. – Ed Morton Apr 08 '15 at 12:46
  • @EdMorton cant use just sub, columns have different spacing – meso_2600 Apr 08 '15 at 12:47
  • You are wrong - sub IS the command to use. I'll post the answer. – Ed Morton Apr 08 '15 at 12:48
  • @EdMorton I've posted one ! –  Apr 08 '15 at 12:49
  • Yes, and I see others have too so I've delete mine. I'd change `"([^"FS"]"FS"){2}"` to `/([^ ] ){2}/` since using FS isn't generally going to work (e.g. if multi-char) anyway and the first arg for sub() is an RE not a string. – Ed Morton Apr 08 '15 at 12:49
  • @EdMorton please undelete yours, I didn't see you around and I am using an answer you posted some time ago to another question (so the credits must go to you!) – fedorqui Apr 08 '15 at 12:51
  • @fedorqui JID had the right approach with sed, just had to be tweaked to awk syntax so either way.... no point having 3 correct answers to choose from! – Ed Morton Apr 08 '15 at 12:52
  • @EdMorton Pretty sure it treats it as an RE and not a string regardless of whether you use `//` or `""` or it always has in my experience. Also i have used this format before with multichar FS and it has worked. Any reason why it wouldn't ? Doesn't it just expand to whatever the FS is ? –  Apr 08 '15 at 12:53
  • @EdMorton Actually i just tried it now and it messed up on some tests, so thanks for the info ! –  Apr 08 '15 at 12:57
  • It definitely wouldn't work with multi-char FS as you can't negate a string in a bracket expression (`[^abc] == !(a||b||c) != !(abc)`). When you use a string in an RE context awk has to do some work to convert the string to an RE before using it. The most obvious consequence of that is you need to escape all escape characters so if you want to write `a\tb` in an RE then you would write `/a\tb/` or `"a\\tb"`. It's a slight gotcha, more to remember and more work for awk - just ALWAYS use RE delimiters around REs unless you need to concatenate strings e.g. `sub(var1".txt","foo")`. – Ed Morton Apr 08 '15 at 13:00
  • It's not that using FS won't work in this particular case, it's just less clear than an explicit blank char, a bit lengthier, a bit misleading as it kinda looks like it's work for any FS, and doesn't add any useful functionality. – Ed Morton Apr 08 '15 at 13:03
  • @EdMorton Yeah i noticed, might as well leave it though, i have put your suggestion beneath and fedorqui has covered using `[[:space:]]`s :) –  Apr 08 '15 at 13:05
  • (awk 3.1.6) awk --re-interval '{sub("([^"FS"]"FS"){2}","")}1' a o p b p p c p p p didnt remove b,c – meso_2600 Apr 08 '15 at 13:11
  • sed would be fine, but: I can't use it, and if one of the lines contains only 1 column this column would not be removed – meso_2600 Apr 08 '15 at 13:14
  • @JID it fails once there is a tab between the columns – meso_2600 Apr 08 '15 at 13:17
  • 2
    @meso_2600 it's extremely important when posting questions to show an example of your problem that covers the various cases you need to deal with. Edit your question to show a truly representative example of your problem, including the cases that you think might be difficult/unusual to deal with, and provide a better explanation of what might be in your input file. Otherwise we're all just guessing and churning trying to figure out what you want. – Ed Morton Apr 08 '15 at 13:26
  • @JID never do `sub($1"[[:space:]]*","")` unless you have complete control over the input, understand all of the gothchas and have a specific purpose in mind. In this case, imagine what would happen if `$1` contained RE metachars like `.*`. – Ed Morton Apr 08 '15 at 13:28
  • @EdMorton it doesnt include tabs, sorry, just imagined what would happen if it had. so sed works, but none of the awk commands are working for me – meso_2600 Apr 08 '15 at 13:29
  • @meso_2600 what version of awk are you using? Which OS? The posted awk commands will work in gawk, all POSIX awks, and most other awks. The only ones I think might give you trouble by not handling RE intervals are /bin/awk on Solaris and nawk. – Ed Morton Apr 08 '15 at 13:33
  • @EdMorton I was just wondering with regards to RE metacharacters in variables. Would that mean that i could never pass `.*` as a variable to awk and then substitute it ? –  Apr 08 '15 at 13:59
  • 1
    @JID maybe. You could try something like `awk -v var='.*' '{gsub(/./,"[&]",var); sub(var,...)}'` but that doesn't work for all cases, e.g. if `var='.*\\'` you'd get a syntax error. `gsub(/./,"\\\\&",var)` might work better, idk.... I just avoid doing it - if you want to replace an RE use something that operates on REs like sub(), if you want to replace a string then use string functions like index()+substr(). If you find yourself trying to escape/disable all RE metachars then clearly you do NOT want an RE! – Ed Morton Apr 08 '15 at 15:07
  • @EdMorton as stated earlier, I use awk 3.1.6. – meso_2600 Apr 10 '15 at 13:53
  • @EdMorton because couple days earlier you asked: "@meso_2600 what version of awk are you using? Which OS? " – meso_2600 Apr 10 '15 at 15:14
  • Oh, OK. A couple of days go == ancient history :-). You still didn't tell us the OS though, nor does `awk 3.1.6` tell us if thats BSD awk 3.1.6 or nawk 3.1.6 or gawk 3.1.6 or.... though I strongly suspect it's gawk and if so then you'll need the `--re-interval` flag as shown in the above answer. – Ed Morton Apr 10 '15 at 15:24
2

In Perl, you can use split with capturing to keep the delimiters:

perl -ne '@f = split /( +)/; print @f[ 1 * 2 .. $#f ]'
#                                      ^
#                                      |
#                              column number goes
#                              here (starting from 0)
choroba
  • 231,213
  • 25
  • 204
  • 289
1

If you want to preserve all spaces after the start of the second column, this will do the trick:

{
    match($0, ($1 "[ \\t*]+"))
    print substr($0, RSTART+RLENGTH)
}

The call to match locates the start of the first 'token' on the line and the length of the first token and the whitespace that follows it. Then you just print everything on the line after that.

You could generalize it somewhat to ignore the first N tokens this way:

BEGIN {
    N = 2
}

{
    r = ""
    for (i=1; i<=N; i++) {
        r = (r $i "[ \\t*]+")
    }
    match($0, r)
    print substr($0, RSTART+RLENGTH)
}

Applying the above script to your example input yields:

o p
o p p
o p p  p
ReluctantBIOSGuy
  • 536
  • 2
  • 12
  • 2
    lol what is this site.This is just `awk '{for(i=0;i<2;i++)sub($1"[[:space:]]*","")}1'` with about ten more lines of useless junk –  Apr 08 '15 at 13:54
  • 2
    Do NOT do this. It's a disaster waiting to happen. – Ed Morton Apr 08 '15 at 16:41
  • JID: It works much like fedorqui's script, but is somewhat easier for non-regex-ninjas to read. Not sure what part of my solution is useless; I think it is all needed to get the proper answer. – ReluctantBIOSGuy Apr 08 '15 at 21:17
  • Ed: What issue do you see with my solution? – ReluctantBIOSGuy Apr 08 '15 at 21:18
  • Ed: Never mind. I read further down and now I see the potential problem. If $1 contains regex metacharacters, then match won't perform as expected. I tried to vote down my answer, but I can't. :-( – ReluctantBIOSGuy Apr 08 '15 at 21:27
  • @ReluctantBIOSGuy it's not about being more readable, but portability. – meso_2600 Apr 10 '15 at 13:51
  • @JID now I can see your newest solution, didn't see it before – meso_2600 Apr 10 '15 at 13:55
  • @ReluctantBIOSGuy - I just happened to come across your comments that you were directing to me. If you want someone to know when you've left a comment for them you have to put an `@` before their login id like I've done in this one to you, it's not enough to just write `Ed` for example. – Ed Morton Apr 10 '15 at 15:00