How to preserve the original whitespace between fields in awk?

Question

When processing input with awk, sometimes I want to edit one of the fields, without touching anything else. Consider this:

$ ls -l | awk 1
total 88
-rw-r--r-- 1 jack jack     8 Jun 19  2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 Jun 19  2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack  4306 Dec 29 09:16 test1.html
-rw-r--r-- 1 jack jack  5476 Dec  7 08:09 test1.js

If I don't edit any of the fields ($1, $2, ...), everything is preserved as it was. But if let's say I want to keep only the first 3 characters of the first field:

$ ls -l | awk '{$1 = substr($1, 1, 3) } 1'
tot 88
-rw 1 jack jack 8 Jun 19 2013 qunit-1.11.0.css
-rw 1 jack jack 56908 Jun 19 2013 qunit-1.11.0.js
-rw 1 jack jack 4306 Dec 29 09:16 test1.html
-rw 1 jack jack 5476 Dec 7 08:09 test1.js

The original whitespace between all fields is replaced with a simple space.

Is there a way to preserve the original whitespace between the fields?

UPDATE

In this sample, it's relatively easy to edit the first 4 fields. But what if I want to keep only the 1st letter of $5 in order to get this output:

-rw-r--r-- 1 jack jack     8 J 19  2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 J 19  2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack  4306 D 29 09:16 test1.html
-rw-r--r-- 1 jack jack  5476 D  7 08:09 test1.js

score 18 · Answer 1 · answered Dec 30 '13 at 09:38

18

If you want to preserve the whitespace you could also try the split function. In Gnu Awk version 4 the split function accepts 4 arguments, where the latter is the separators between the fields. For instance,

echo "a  2   4  6" | gawk ' {
 n=split($0,a," ",b)
 a[3]=7
 line=b[0]
 for (i=1;i<=n; i++)
     line=(line a[i] b[i])
 print line
}'

gives output

a  2   7  6

answered Dec 30 '13 at 09:38

Håkon Hægland

39,012
21
81
174

4

This is **THE** right answer and is the main reason why the 4th arg to `split()` was introduced. Anything else gets very complicated in the general case where an FS can be any regexp, not just the default white space or anything else you can simply negate in a bracket expression. – Ed Morton Jul 17 '16 at 15:12

score 10 · Accepted Answer · edited May 23 '17 at 10:31

I know this is an old question but I thought there had to be something better. This answer is for those that stumbled onto this question while searching. While looking around on the web, I have to say @Håkon Hægland has the best answer and that is what I used at first.

But here is my solution. Use FPAT. It can set a regular expression to say what a field should be.

 FPAT = "([[:space:]]*[[:alnum:][:punct:][:digit:]]+)";

In this case, I am saying the field should start with zero or more blank characters and ends with basically any other character except blank characters. Here is a link if you are having trouble understanding POSIX bracket expressions.

Also, change the output field to OFS = ""; separator because once the line has been manipulated, the output will add an extra blank space as a separator if you don't change OFS from its default.

I used the same example to test.

$ cat example-output.txt
-rw-r--r-- 1 jack jack     8 Jun 19  2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 Jun 19  2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack  4306 Dec 29 09:16 test1.html
-rw-r--r-- 1 jack jack  5476 Dec  7 08:09 test1.js

$ awk 'BEGIN { FPAT = "([[:space:]]*[[:alnum:][:punct:][:digit:]]+)"; OFS = ""; } { $6 = substr( $6, 1, 2);  print $0; }' example-output.txt
-rw-r--r-- 1 jack jack     8 J 19  2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 J 19  2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack  4306 D 29 09:16 test1.html
-rw-r--r-- 1 jack jack  5476 D  7 08:09 test1.js

Keep in mind. The fields now have leading spaces. So if the field needs to be replaced by something else, you can do

len = length($1); 
$1 = sprintf("%"(len)"s", "-42-");

$ awk 'BEGIN { FPAT = "([[:space:]]*[[:alnum:][:punct:][:digit:]]+)"; OFS = ""; } { if(NR==1){ len = length($1); $1 = sprintf("%"(len)"s", "-42-"); } print $0; }' example-output.txt
      -42- 1 jack jack     8 Jun 19  2013 qunit-1.11.0.css
-rw-r--r-- 1 jack jack 56908 Jun 19  2013 qunit-1.11.0.js
-rw-r--r-- 1 jack jack  4306 Dec 29 09:16 test1.html
-rw-r--r-- 1 jack jack  5476 Dec  7 08:09 test1.js

You could replace `[[:alnum:][:punct:][:digit:]]` with `[^[:space:]]` and in addition to being briefer the solution will be more robust. idk what that stuff with `-42-` is all about but if you're just trying to show SOMETHING in a field width it'd be written as `$1 = sprintf("%*s", len, "-42-")`, not `$1 = sprintf("%"(len)"s", "-42-")`. Obviously this whole solution falls apart when using other than the default FS so [@Hakon's solution](http://stackoverflow.com/a/20836890/1745001) is preferred. — Ed Morton, Jul 17 '16 at 15:06

score 6 · Answer 3 · answered Apr 21 '18 at 21:21

The simplest solution is to make sure that the field spliting is done on every single space. That is done by making the field separator [ ]:

$ awk -F '[ ]' '{$1=substr($1,1,3)}1' infile

-rw 1 jack jack     8 Jun 19  2013 qunit-1.11.0.css
-rw 1 jack jack 56908 Jun 19  2013 qunit-1.11.0.js
-rw 1 jack jack  4306 Dec 29 09:16 test1.html
-rw 1 jack jack  5476 Dec  7 08:09 test1.js

By default, awk will split on any repetition of white spaces (tabs and spaces, something similar to [ \t]+. The manual states:

In the special case that FS is a single space, fields are separated by runs of spaces and/or tabs and/or newlines.

That will collapse runs of spaces, tabs and newlines to only one value of OFS in the output. If OFS is also an space (also the default), the result is that only one space will be printed for each run of white space.

But awk could be told to select only one space as a field delimiter using a regular expression that will match only one character: [ ].

Note that that will change the field numbers of fields. Each space will start a new field. So, note this result from the data you presented:

$ awk -F '[ ]' '{print($4,$5,$6)}' infile
jack
jack 56908 Jun
jack  4306
jack  5476

In this specific case, there are no spaces before the first field, and only one space after, that's why it works correctly.

score 5 · Answer 4 · edited May 23 '17 at 12:26

It's possible to preserve the original whitespaces by editing $0 instead of individual fields ($1, $2, ...), for example:

$ ls -l | awk '{$0 = substr($1, 1, 3) substr($0, length($1) + 1)} 1'
tot 88
-rw 1 jack jack     8 Jun 19  2013 qunit-1.11.0.css
-rw 1 jack jack 56908 Jun 19  2013 qunit-1.11.0.js
-rw 1 jack jack  4306 Dec 29 09:16 test1.html
-rw 1 jack jack  5476 Dec  7 08:09 test1.js

This is relatively easy to do when editing the first column, but gets troublesome when editing others ($2, ..., $4), and breaks down after fields where the width of the whitespace in between is not fixed ($5 and beyond in this example).

UPDATE

Based on @Håkon Hægland's answer, here's a way to keep the first 2 characters of the 6th field (the month):

{
    n = split($0, f, " ", sep)
    f[6] = substr(f[6], 1, 2)
    line = sep[0]
    for (i = 1; i <= n; ++i) line = line f[i] sep[i]
    print line
}

With GNU awk, I suggest something like `if (match($0, "^([^ \t]+)[ \t]+([^ \t]+)[ \t]+([^ \t]+)", fields)) { … }` to find out where the fields are. Then you can use `fields[2, "start"]` and `fields[2, "start"] + fields[2, "length"] - 1`, for example, to get the indexes of the start and end of the second field. — 200_success, Aug 21 '15 at 19:28

How to preserve the original whitespace between fields in awk?

4 Answers4

Linked

Related