Extracting name and number from string

Question

Similar to this question I have a string of names and numbers separated by a colon:

s = 'Waz D: 5 l gu l: 5 GrinVe: 3 P LUK: 2 Cubbi: 1 2 nd dok: 1 maf 74: 1 abr12: 1 Waza D 5'

I'm trying to split this to get:

 ('Waz D', '5'),
 ('l gu l', '5'),
 ('GrinVe', '3'),
 ('P LUK', '2'),
 ('Cubbi', '1'),
 ('2 nd dok', '1')
 ('maf 74', '1')
 ('abr12', '1')

I have tried two regular expressions so far with mixed success:

re.findall(r"(.*?)[a-zA-Z0-9]+: (\d+)*", s)
[('Waz ', '5'),
 (' l gu ', '5'),
 (' ', '3'),
 (' P ', '2'),
 (' ', '1'),
 (' 2 nd ', '1'),
 (' maf ', '1'),
 (' ', '1')]

And:

re.findall(r"(.*?)([a-zA-Z0-9]+): (\d+)*", s)
[('Waz ', 'D', '5'),
 (' l gu ', 'l', '5'),
 (' ', 'GrinVe', '3'),
 (' P ', 'LUK', '2'),
 (' ', 'Cubbi', '1'),
 (' 2 nd ', 'dok', '1'),
 (' maf ', '74', '1'),
 (' ', 'abr12', '1')]

How can I adjust this to get the output I'm after?

Try [`re.findall(r'\s*(.*?[^\W_]):\s*(\d+)', s)`](https://ideone.com/XTNZJl) — Wiktor Stribiżew, Jul 04 '18 at 09:43

score 1 · Answer 1 · answered Jul 04 '18 at 09:45

Consume the whitespace greedily and don't put it into the matching groups.

>>> import re
>>> s = 'Waz D: 5 l gu l: 5 GrinVe: 3 P LUK: 2 Cubbi: 1 2 nd dok: 1 maf 74: 1 abr12: 1 Waza D 5'
>>> 
>>> re.findall('([^:]+?):\s*(\d+)\s*', s)
[('Waz D', '5'), ('l gu l', '5'), ('GrinVe', '3'), ('P LUK', '2'), ('Cubbi', '1'), ('2 nd dok', '1'), ('maf 74', '1'), ('abr12', '1')]

score 1 · Answer 2 · answered Jul 04 '18 at 09:52

If we assume that the string is always followed by a semicolon-space-number-space sequence, you can do it like this:

re.findall(r"(.+?):\s(\d+)\s", s)

[('Waz D', '5'),
 ('l gu l', '5'),
 ('GrinVe', '3'),
 ('P LUK', '2'),
 ('Cubbi', '1'),
 ('2 nd dok', '1'),
 ('maf 74', '1'),
 ('abr12', '1')]

score 1 · Accepted Answer · answered Jul 04 '18 at 09:53

1

It comes down to splitting on the combination : \d, nothing else (besides suppressing leading and following whitespace here and there). All it needs is a group of any length that does not contain a colon :, followed by that colon and then a single run of digits.

import re
s = 'Waz D: 5 l gu l: 5 GrinVe: 3 P LUK: 2 Cubbi: 1 2 nd dok: 1 maf 74: 1 abr12: 1 Waza D 5'

print (re.findall(r'([^:]+):\s*(\d+)\s+', s))

result:

[('Waz D', '5'),
 ('l gu l', '5'),
 ('GrinVe', '3'),
 ('P LUK', '2'),
 ('Cubbi', '1'),
 ('2 nd dok', '1'),
 ('maf 74', '1'),
 ('abr12', '1')]

answered Jul 04 '18 at 09:53

Jongware

22,200
8
54
100

I tried this on a string containing "kuti2)(G): 2" as well and it worked. Is that a fortunate coincidence or will this work always for parenthesis? – Chuck Jul 04 '18 at 10:10
@Chuck: no, it's not a coincidence and yes, it will work for any string. The contents of the first match `[^:]+` does not matter *at all*. The only exception is the colon itself. – Jongware Jul 04 '18 at 11:13

score 1 · Answer 4 · answered Jul 04 '18 at 09:56

You could match zero or more times a whitespace character followed by capturing in a group not a colon using a negated character class ([^:]+).

Then match a colon, zero or more whitespace characters \s* and capture in a group one or more digits (\d+)

\s*([^:]+):\s*(\d+)

Demo

Valdi_Bo · Answer 5 · 2018-07-04T11:23:01.390

In your sample the name starts generally from a letter, but in 1 case - from a digit.

So the first capturing group, for the name should:

start with [a-z\d] (remember of re.I flag at the end),
then it should contain [^:]* - a sequence of chars other than :.

Your solution ([a-zA-Z0-9]+) is wrong, because the name can contain spaces.

The second group, matching the number is simple - just \d+.

Between these 2 groups there should be :\s* - a colon and a sequence of white chars.

The code contains a single call to re.findall, as follows:

re.findall(r"([a-z\d][^:]*):\s*(\d+)", s, flags=re.I)

But I am in doubt about Cubbi: 1 2 in your sample. Should the 2 really be a part of the next name?

If not, consider changing the regex to: ([a-z][^:]*):\s*(\d+(?: \d+)?). Differences:

The name must start with a letter (not a digit),
The number can contain the "second part", with a preceding single space - (?: \d+)?.

Then 1 2 will be the "numer" for Cubbi and the next name will start from "nd".

And what about Waza D 5 at the end of your sample? Did you forget to put the colon before 5?

PythonProgrammi · Answer 6 · 2018-07-08T17:34:34.093

My solution

I've added a ':' after Waza D because I think there should be (I think it was a typo, because the rule should be name: number). The pattern, for me, is a name starting with a letter and followed by other letters/numbers and spaces until the : a space and a number.

s = 'Waz D: 5 l gu l: 5 GrinVe: 3 P LUK: 2 Cubbi: 1 2 nd dok: 1 maf 74: 1 abr12: 1 Waza D: 5'

import re

# \w find something starting with a letter
# [\w\s]+ followed by any number of letter and space
# : followed by a :
# \s[0-9] and a space and a number
x = re.findall(r"\w[\w\s]+:\s[0-9]", s)
print(*x, sep="\n")

output

Waz D: 5
l gu l: 5
GrinVe: 3
P LUK: 2
Cubbi: 1
2 nd dok: 1
maf 74: 1
abr12: 1
Waza D: 5

Extracting name and number from string

6 Answers6

My solution