2

Suppose I have a dataset with names and registers like

John Wayne 1234
Paul Newman 2345 Wrong register. The correct register is 2233
John Fitzgerald Kennedy 3456
Marilyn Monroe 1212

All lines are space separated. I want one (or two) regular expressions to use in awk that give me the following outputs:

John Wayne
Paul Newman
John Fitzgerald Kennedy
Marilyn Monroe

and

1234
2233
3456
1212

I know the data is in a very, very bad formatting, but does anyone know how to help me?

Marcus Nunes
  • 851
  • 1
  • 18
  • 33

5 Answers5

2

grep could be used for generating the two outputs separately. See the below test:

$  cat f
John Wayne 1234
Paul Newman 2345 Wrong register. The correct register is 2233
John Fitzgerald Kennedy 3456
Marilyn Monroe 1212

Output I:

$  grep -o '^[^0-9]\+' f                                          
John Wayne 
Paul Newman 
John Fitzgerald Kennedy 
Marilyn Monroe

Output II:

$  grep -o '[0-9]\+$' f 
1234
2233
3456
1212

The regexs used above are relatively straightforward. Using same idea, you could apply the regex with sed or awk too, if you like.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Kent
  • 189,393
  • 32
  • 233
  • 301
1

This case is fairly simple, since the numbers are after the last separator, so we would treat the last column as if we're unaware of its content like this:

awk '{print $NF}'

For the rest we'll simply match all the letters including spaces until we get to a non-letter character (such as a number) and then we'll just replace all the rest with null:

sed 's/\([A-z ]*\) .*/\1/g'
Yaron
  • 1,199
  • 1
  • 15
  • 35
  • 1
    Kudos for a clever, pragmatic `awk` solution, but your `sed` command leaves a trailing space; you could append a 2nd command inside the `sed` script to remedy: `; s/ $//` – mklement0 Oct 22 '15 at 21:16
  • 1
    Yes, great (++) - better than my suggestion. – mklement0 Oct 22 '15 at 22:00
1

You can use sed:

sed 's/[[:blank:]]*[[:digit:]]\+.*$//' file
John Wayne
Paul Newman
John Fitzgerald Kennedy
Marilyn Monroe

sed 's/.*[[:blank:]]\([[:digit:]]\+\)$/\1/' file
1234
2233
3456
1212
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Late to the party, but this lets you do both jobs at once:

#!/usr/bin/awk -f

    {
        nums = nums "\n" $NF
        split($0, a, " [0-9]{4}")
        names = names a[1] "\n"
    }

END {
        print names nums
    }

First, it takes the last field of the line and adds it to a list of numbers. Then it splits the line on any 4-digit number and adds the part before the split to a list of names. Finally, it prints the list of names followed by the list of numbers.

Output:

John Wayne
Paul Newman
John Fitzgerald Kennedy
Marilyn Monroe

1234
2233
3456
1212

If extraneous spaces are of concern, pipe to cat -e to make it very clear where whitespace may have occurred.

Andrew
  • 475
  • 4
  • 15
  • Nicely done, but you should use `" [0-9]{4}"` (note the leading space) to eliminate a trailing space after the names. Also, `{ print names nums }` (no comma) will avoid a single space on the separator line. Perhaps you can reformat the `awk` command to be multi-line for readability, and provide sample output. – mklement0 Oct 22 '15 at 21:26
  • Good suggestions. Thanks. – Andrew Oct 22 '15 at 21:35
  • Thanks for updating, ++; note that POSIX-like shells such as `bash` do support multi-line string literals, so retaining the _CLI_ form of the solution _combined with a multi-line string_ offers the best of both worlds: readability, while still being able to paste the command into a terminal for quick tests; see [here](http://stackoverflow.com/a/33271539/45375) for an example. – mklement0 Oct 22 '15 at 21:44
0

awk lets you specify a character set as the field separator. Therefore, if you know that your names are always followed by numbers, you can use:

awk -F"[0-9]" '{print $1}' /tmp/x
mklement0
  • 382,024
  • 64
  • 607
  • 775
Keith Hanlan
  • 768
  • 7
  • 13
  • Nice, but you're only answering half the question (you're extracting the names only, not the numbers). `-F"[0-9]"` (by `` I mean an actual space char.) would eliminate the trailing spaces from the output. – mklement0 Oct 22 '15 at 22:02