How to make the 'cut' command treat same sequental delimiters as one?

Question

I'm trying to extract a certain (the fourth) field from the column-based, 'space'-adjusted text stream. I'm trying to use the cut command in the following manner:

cat text.txt | cut -d " " -f 4

Unfortunately, cut doesn't treat several spaces as one delimiter. I could have piped through awk

awk '{ printf $4; }'

or sed

sed -E "s/[[:space:]]+/ /g"

to collapse the spaces, but I'd like to know if there any way to deal with cut and several delimiters natively?

Possible duplicate of [linux cut help - how to specify more spaces for the delimiter?](http://stackoverflow.com/questions/7142735/linux-cut-help-how-to-specify-more-spaces-for-the-delimiter) — Inanc Gumus, Jan 13 '17 at 18:53
I love `awk` BUT when you are doing `kubectl ... bash -c 'awk ...'` and similar, things start to get funny with quotes, parameter references, etc. Then it's actually quite nice to whip out the old rudimentary tools from the toolbox. — sastorsl, Apr 29 '22 at 06:51

score 589 · Accepted Answer · edited Apr 23 '18 at 23:03

589

Try:

tr -s ' ' <text.txt | cut -d ' ' -f4

From the tr man page:

-s, --squeeze-repeats   replace each input sequence of a repeated character
                        that is listed in SET1 with a single occurrence
                        of that character

edited Apr 23 '18 at 23:03

Austin Adams

6,535
3
23
27

answered Dec 19 '10 at 16:22

kev

155,172
47
273
272

27

No need for `cat` here. You could pass `< text.txt` directly to `tr`. http://en.wikipedia.org/wiki/Cat_%28Unix%29#Useless_use_of_cat – arielf Aug 09 '14 at 20:10
1

Not sure it is any simpler, but you are going to merge, you can forgo cut's `-d` and translate straight from multiple characters to tab. For example: I came here looking for a way to automatically export my display: `who am i | tr -s ' ()' '\t' | cut -f5` – Leo Mar 28 '16 at 23:24
This doesn't remove leading/trailing whitespace (which may or may not be wanted, but usually isn't), in contrast with the awk solution. The awk solution is also much more readable and less verbose. – n.caillou Apr 04 '18 at 23:31
-1 **WARNING: THIS IS NOT THE SAME THING AS TREATING SEQUENTIAL DELIMETERS AS ONE.** Compare `echo "a b c" | cut -d " " -f2-`, `echo "a b c" | tr -s " " | cut -d " " -f2-` – user541686 Jul 21 '19 at 10:01
1

@user541686 Yes it is. Your example demonstrates exactly this. To see, try changing `-f2-` to `-f3-`. This shows that in the cut-only approach, there are 4 fields: 'a', 'b', '', and 'c', whereas in the tr-cut approach there are only 3: 'a', 'b', and 'c'. – ibonyun Oct 04 '22 at 21:08

score 105 · Answer 2 · edited May 23 '17 at 12:26

105

As you comment in your question, awk is really the way to go. To use cut is possible together with tr -s to squeeze spaces, as kev's answer shows.

Let me however go through all the possible combinations for future readers. Explanations are at the Test section.

tr | cut

tr -s ' ' < file | cut -d' ' -f4

awk

awk '{print $4}' file

bash

while read -r _ _ _ myfield _
do
   echo "forth field: $myfield"
done < file

sed

sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' file

Tests

Given this file, let's test the commands:

$ cat a
this   is    line     1 more text
this      is line    2     more text
this    is line 3     more text
this is   line 4            more    text

tr | cut

$ cut -d' ' -f4 a
is
                        # it does not show what we want!


$ tr -s ' ' < a | cut -d' ' -f4
1
2                       # this makes it!
3
4
$

awk

$ awk '{print $4}' a
1
2
3
4

bash

This reads the fields sequentially. By using _ we indicate that this is a throwaway variable as a "junk variable" to ignore these fields. This way, we store $myfield as the 4th field in the file, no matter the spaces in between them.

$ while read -r _ _ _ a _; do echo "4th field: $a"; done < a
4th field: 1
4th field: 2
4th field: 3
4th field: 4

sed

This catches three groups of spaces and no spaces with ([^ ]*[ ]*){3}. Then, it catches whatever coming until a space as the 4th field, that it is finally printed with \1.

$ sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' a
1
2
3
4

edited May 23 '17 at 12:26

Community

1
1

answered Sep 23 '14 at 10:27

fedorqui

275,237
103
548
598

2

`awk` is not only elegant and simple, it is also included in VMware ESXi, where `tr` is missing. – user121391 May 10 '16 at 09:19
2

@user121391 yet another reason to use `awk`! – fedorqui May 10 '16 at 09:29
@fedorqui I've never heard of the underscore as "junk variable". Can you provide any more insight/reference on this? – BryKKan Nov 14 '17 at 16:01
1

@BryKKan I learnt about it in Greg's [How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?](http://mywiki.wooledge.org/BashFAQ/001): _Some people use the throwaway variable _ as a "junk variable" to ignore fields. It (or indeed any variable) can also be used more than once in a single `read` command, if we don't care what goes into it_. It can be anything, it is just that it somehow became standard instead of `junk_var` or `whatever` :) – fedorqui Nov 15 '17 at 07:37
@BryKKan In Javascript it also represents a function parameter that is not meant to be used. – Adrian Sep 28 '21 at 14:55

score 27 · Answer 3 · edited May 23 '17 at 10:31

shortest/friendliest solution

After becoming frustrated with the too many limitations of cut, I wrote my own replacement, which I called cuts for "cut on steroids".

cuts provides what is likely the most minimalist solution to this and many other related cut/paste problems.

One example, out of many, addressing this particular question:

$ cat text.txt
0   1        2 3
0 1          2   3 4

$ cuts 2 text.txt
2
2

cuts supports:

auto-detection of most common field-delimiters in files (+ ability to override defaults)
multi-char, mixed-char, and regex matched delimiters
extracting columns from multiple files with mixed delimiters
offsets from end of line (using negative numbers) in addition to start of line
automatic side-by-side pasting of columns (no need to invoke paste separately)
support for field reordering
a config file where users can change their personal preferences
great emphasis on user friendliness & minimalist required typing

and much more. None of which is provided by standard cut.

See also: https://stackoverflow.com/a/24543231/1296044

Source and documentation (free software): http://arielf.github.io/cuts/

Chris Koknat · Answer 4 · 2015-10-07T17:12:16.603

4

This Perl one-liner shows how closely Perl is related to awk:

perl -lane 'print $F[3]' text.txt

However, the @F autosplit array starts at index $F[0] while awk fields start with $1

edited Oct 07 '15 at 17:12

answered Sep 09 '15 at 17:16

Chris Koknat

3,305
2
29
30

score 3 · Answer 5 · answered Nov 10 '10 at 10:37

With versions of cut I know of, no, this is not possible. cut is primarily useful for parsing files where the separator is not whitespace (for example /etc/passwd) and that have a fixed number of fields. Two separators in a row mean an empty field, and that goes for whitespace too.

score 0 · Answer 6 · answered Aug 03 '23 at 04:28

I've created a patch that adds new -m command-line option to cut, which works in the field mode and treats multiple consecutive delimiters as a single delimiter. This basically solves the OP's question in a rather efficient way. I also submitted this patch upstream a couple of days ago, and let's hope that it will be merged into the coreutils project.

There are some further thoughts about adding even more whitespace-related features to cut, and having some feedback about all that would be great. I'm willing to implement more patches for cut and submit them upstream, which would make this utility more versatile and more usable in various real-world scenarios.

How to make the 'cut' command treat same sequental delimiters as one?

6 Answers6

tr | cut

awk

bash

sed

Tests

tr | cut

awk

bash

sed

shortest/friendliest solution

Linked

Related