Separating words from numbers in Python

Question

I have to search a string for words that have a number as prefix or suffix (Example, "abc21" or "943xyz". Then, I need to split the number from the word.

For example, "abc12" has to converted to "abc 12" or "12abc" has to be converted to "12 abc"

However, if the number lies in between letters, for example, "a12bc", then it should be left as it is. How can we do this? Is there a simpler way than regex?

Please show what you have tried already to solve this problem. — roganjosh, Jan 25 '18 at 21:17
this is pretty close : https://stackoverflow.com/questions/430079/how-to-split-strings-into-text-and-number — jmunsch, Jan 25 '18 at 21:18
@jmunsch pretty close + too broad = closing to me :) thanks for the link — Jean-François Fabre, Jan 25 '18 at 21:19
That's a pretty far-fetched duplicate, so I'm gonna leave a hint for the OP: The regex in that question `\D+\d+` matches only words with digits at the end. Duplicate that and turn it around, you get `\D+\d+|\d+\D+` which matches words with digits on either end. From there you just need to figure out how to insert a space. (Hint #2: `re.sub`) — Aran-Fey, Jan 25 '18 at 21:23
@Rawing actually, I think `\w` matches alphanumeric, so that might not play well, perhaps best to use `\d\D` — juanpa.arrivillaga, Jan 25 '18 at 21:28
@Rawing a somewhat hacky approach: `re.sub(r'((\\D+)(\\d+))|((\\d+)(\\D+))', r"\2 \3\5 \6", '943xyz').strip()` I'm not sure if I grasp grouping correctly. — juanpa.arrivillaga, Jan 25 '18 at 21:30
@juanpa.arrivillaga You don't need that many groups. `re.sub(r'(\D+)(\d+)|(\d+)(\D+)', r"\1\3 \2\4", '943xyz')` works too :) — Aran-Fey, Jan 25 '18 at 21:33
@Rawing yep, was definitely over-doing it. Wasn't sure about the precedence of alternation in regex... — juanpa.arrivillaga, Jan 25 '18 at 21:34
hey also welcome to stackoverflow check these out when you get time : https://stackoverflow.com/tour AND https://stackoverflow.com/help/how-to-ask AND https://meta.stackexchange.com/questions/21788/how-does-editing-work — jmunsch, Jan 25 '18 at 21:35

score 0 · Answer 1 · answered Jan 25 '18 at 21:54

Something simple like one of these.
All that's needed is to protect the boundary's with these (?<! [\da-z] ) .. (?! [\da-z] )
which does 2 things:
- it stops the engine from matching between like kinds (digits or alphas).
- insures no bookend types.

Way 1:

Find (?<![\da-z])(?:([a-z]+)(\d+)|(\d+)([a-z]+))(?![\da-z])
Replace $1$3 $2$4

https://regex101.com/r/k4gNoE/1

 (?<! [\da-z] )
 (?:
      ( [a-z]+ )             # (1)
      ( \d+ )                # (2)
   |  
      ( \d+ )                # (3)
      ( [a-z]+ )             # (4)
 )
 (?! [\da-z] )

Way 2:

Find (?<![\da-z])(?:([a-z]+(?=\d)|\d+(?=[a-z]))((?<=\d)[a-z]+|(?<=[a-z])\d+))(?![\da-z]) Replace $1 $2

https://regex101.com/r/LbWnkg/1

 (?<! [\da-z] )
 (?:
      (                        # (1 start)
           [a-z]+ 
           (?= \d )
        |  \d+ 
           (?= [a-z] )
      )                        # (1 end)
      (                        # (2 start)
           (?<= \d )
           [a-z]+ 
        |  (?<= [a-z] )
           \d+ 
      )                        # (2 end)
 )
 (?! [\da-z] )

score 0 · Answer 2 · answered Jan 25 '18 at 22:05

You can try this:

def split_vals(s):
  return ' '.join(re.findall('^\d+|\d+$|^[a-zA-Z]\d+[a-zA-Z]+$|^[a-zA-Z]+$|[a-zA-Z]+', s))
s = ["abc21", "943xyz", '12abc', "a12bc"]
new_s = list(map(split_vals, s))

Output:

['abc 21', '943 xyz', '12 abc', 'a12bc']

score 0 · Answer 3 · answered Jan 26 '18 at 09:12

You can use re.sub to insert that space:

re.sub(r'\b(?:(\D+)(\d+)|(\d+)(\D+))\b', r"\1\3 \2\4", word)

This matches digits followed by non-digits or vice-versa.

The \b boundaries make sure the word is matched in its entirety, so that we don't match numbers in the middle of a word.

The replacement pattern \1\3 \2\4 takes advantage of the fact that unmatched groups are replaced with the empty string. We know that either group 1 and 2 or group 3 and 4 will match, and the other groups will be empty, so \1\3 \2\4 will always produce a valid result (without duplicating any part of the input).

Examples:

>>> re.sub(r'\b(?:(\D+)(\d+)|(\d+)(\D+))\b', r"\1\3 \2\4", "abc12")
'abc 12'
>>> re.sub(r'\b(?:(\D+)(\d+)|(\d+)(\D+))\b', r"\1\3 \2\4", "12abc")
'12 abc'
>>> re.sub(r'\b(?:(\D+)(\d+)|(\d+)(\D+))\b', r"\1\3 \2\4", "a12bc")
'a12bc'

Thank you! I did almost the same thing, but I didn't put \b in. That's why I was getting the error. — Aastha Jairath, Jan 26 '18 at 20:16

Separating words from numbers in Python

3 Answers3