Split data separated by spaces

Question

Suppose I have a dataset with names and registers like

John Wayne 1234
Paul Newman 2345 Wrong register. The correct register is 2233
John Fitzgerald Kennedy 3456
Marilyn Monroe 1212

All lines are space separated. I want one (or two) regular expressions to use in awk that give me the following outputs:

John Wayne
Paul Newman
John Fitzgerald Kennedy
Marilyn Monroe

and

I know the data is in a very, very bad formatting, but does anyone know how to help me?

you want to have one awk script to generate two outputs? or it is acceptable to have two commands/scripts to generate the two outputs separately? — Kent, Oct 22 '15 at 20:16

score 2 · Accepted Answer · edited Oct 22 '15 at 21:21

2

grep could be used for generating the two outputs separately. See the below test:

$  cat f
John Wayne 1234
Paul Newman 2345 Wrong register. The correct register is 2233
John Fitzgerald Kennedy 3456
Marilyn Monroe 1212

Output I:

$  grep -o '^[^0-9]\+' f                                          
John Wayne 
Paul Newman 
John Fitzgerald Kennedy 
Marilyn Monroe

Output II:

$  grep -o '[0-9]\+$' f 
1234
2233
3456
1212

The regexs used above are relatively straightforward. Using same idea, you could apply the regex with sed or awk too, if you like.

edited Oct 22 '15 at 21:21

mklement0

382,024
64
607
775

answered Oct 22 '15 at 20:21

Kent

189,393
32
233
301

Nice, but the 1st `grep` command leaves a trailing space on each output line. – mklement0 Oct 22 '15 at 21:22

Yaron · Answer 2 · 2015-10-22T21:56:01.347

1

This case is fairly simple, since the numbers are after the last separator, so we would treat the last column as if we're unaware of its content like this:

awk '{print $NF}'

For the rest we'll simply match all the letters including spaces until we get to a non-letter character (such as a number) and then we'll just replace all the rest with null:

sed 's/\([A-z ]*\) .*/\1/g'

edited Oct 22 '15 at 21:56

answered Oct 22 '15 at 20:17

Yaron

1,199
1
15
35

1

Kudos for a clever, pragmatic `awk` solution, but your `sed` command leaves a trailing space; you could append a 2nd command inside the `sed` script to remedy: `; s/ $//` – mklement0 Oct 22 '15 at 21:16
1

Yes, great (++) - better than my suggestion. – mklement0 Oct 22 '15 at 22:00

score 1 · Answer 3 · answered Oct 22 '15 at 20:21

1

You can use sed:

sed 's/[[:blank:]]*[[:digit:]]\+.*$//' file
John Wayne
Paul Newman
John Fitzgerald Kennedy
Marilyn Monroe

sed 's/.*[[:blank:]]\([[:digit:]]\+\)$/\1/' file
1234
2233
3456
1212

answered Oct 22 '15 at 20:21

anubhava

761,203
64
569
643

Andrew · Answer 4 · 2015-10-22T21:37:36.267

1

Late to the party, but this lets you do both jobs at once:

#!/usr/bin/awk -f

    {
        nums = nums "\n" $NF
        split($0, a, " [0-9]{4}")
        names = names a[1] "\n"
    }

END {
        print names nums
    }

First, it takes the last field of the line and adds it to a list of numbers. Then it splits the line on any 4-digit number and adds the part before the split to a list of names. Finally, it prints the list of names followed by the list of numbers.

Output:

John Wayne
Paul Newman
John Fitzgerald Kennedy
Marilyn Monroe

1234
2233
3456
1212

If extraneous spaces are of concern, pipe to cat -e to make it very clear where whitespace may have occurred.

edited Oct 22 '15 at 21:37

answered Oct 22 '15 at 21:06

Andrew

475
4
15

Nicely done, but you should use `" [0-9]{4}"` (note the leading space) to eliminate a trailing space after the names. Also, `{ print names nums }` (no comma) will avoid a single space on the separator line. Perhaps you can reformat the `awk` command to be multi-line for readability, and provide sample output. – mklement0 Oct 22 '15 at 21:26
Good suggestions. Thanks. – Andrew Oct 22 '15 at 21:35
Thanks for updating, ++; note that POSIX-like shells such as `bash` do support multi-line string literals, so retaining the _CLI_ form of the solution _combined with a multi-line string_ offers the best of both worlds: readability, while still being able to paste the command into a terminal for quick tests; see [here](http://stackoverflow.com/a/33271539/45375) for an example. – mklement0 Oct 22 '15 at 21:44

score 0 · Answer 5 · edited Oct 22 '15 at 22:02

0

awk lets you specify a character set as the field separator. Therefore, if you know that your names are always followed by numbers, you can use:

awk -F"[0-9]" '{print $1}' /tmp/x

edited Oct 22 '15 at 22:02

mklement0

382,024
64
607
775

answered Oct 22 '15 at 20:21

Keith Hanlan

768
7
13

Nice, but you're only answering half the question (you're extracting the names only, not the numbers). `-F"[0-9]"` (by `` I mean an actual space char.) would eliminate the trailing spaces from the output. – mklement0 Oct 22 '15 at 22:02

Split data separated by spaces

5 Answers5