-1

I have a string : 5kg. I need to make the numerical and the textual parts apart. So, in this case, it should produce two parts : 5 and kg.

For that I wrote a code:

grocery_uom = '5kg'
unit_weight, uom = grocery_uom.split('[a-zA-Z]+', 1)

print(unit_weight)

Getting this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-23a4dd3345a6> in <module>()
      1 grocery_uom = '5kg'
----> 2 unit_weight, uom = grocery_uom.split('[a-zA-Z]+', 1)
      3 #print(unit_weight)
      4 
      5 

ValueError: not enough values to unpack (expected 2, got 1)
    print(uom)

Edit: I wrote this:

unit_weight, uom  = re.split('[a-zA-Z]+', grocery_uom, 1) 
print(unit_weight)
print('-----')
print(uom)

Now I am getting this output:

5 
-----

How to store the 2nd part of the string to a var?

Edit1: I wrote this which solved my purpose (Thanks to Peter Wood):

unit_weight = re.split('([a-zA-Z]+)', grocery_uom, 1)[0]
uom = re.split('([a-zA-Z]+)', grocery_uom, 1)[1]
Debbie
  • 911
  • 3
  • 20
  • 45
  • 4
    Splitting using [**`str.split`**](https://docs.python.org/2/library/stdtypes.html#str.split) doesn't use regular expressions, you just specify delimiters. You probably want [**`re.split`**](https://docs.python.org/3/library/re.html#re.split). e.g. `re.split('[a-zA-Z]+', grocery_uom, 1)` – Peter Wood Mar 30 '19 at 14:55
  • @PeterWood : check edit plz. – Debbie Mar 30 '19 at 15:04
  • 1
    If you want to include the matched splitting group you need to use `'([a-zA-Z]+)'`, however, this will produce `['5', 'kg', '']`, I'm not yet sure why. There are probably better ways than splitting, e.g. just matching the pattern instead. – Peter Wood Mar 30 '19 at 15:10
  • Thanks. I wrote this which solved my purpose: unit_weight = re.split('([a-zA-Z]+)', grocery_uom, 1)[0] uom = re.split('([a-zA-Z]+)', grocery_uom, 1)[1] – Debbie Mar 30 '19 at 15:18
  • @Debbie check answer below to include cases with numbers with commas (>=1,000) – SanV Mar 30 '19 at 15:33

3 Answers3

5

You don't want to split on the "kg", because that means it's not part of the actual data. Although looking at the docs, I see you can include them https://docs.python.org/3/howto/regex.html But the split pattern is intended to be a separater.

Here's an example of just making a pattern for exactly what you want:

import re
pattern = re.compile(r'(?P<weight>[0-9]+)\W*(?P<measure>[a-zA-Z]+)')

text = '5kg'
match = pattern.search(text)
print (match.groups())
weight, measure = match.groups()
print (weight, measure)
print ('the weight is', match.group('weight'))
print ('the unit is', match.group('measure'))
print (match.groupdict())

output

('5', 'kg')
5 kg
the weight is 5
the unit is kg
{'weight': '5', 'measure': 'kg'}

Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30
  • You aren't actually taking advantage of the `(?P...)` groups; you're just treating the result as a normal tuple. `re.compile(r'([0-9]+)([a-zA-Z]+)')` will suffice in that case. – chepner Mar 30 '19 at 15:17
  • Since this is a pattern on the data, I was able to account for optional whitespace by just adding it to the pattern. Try it with " 5 kg " – Kenny Ostrom Mar 30 '19 at 15:38
1

*updated to allow for bigger numbers, such as "1,000"
Try this.

import re
grocery_uom = '5kg'
split_str = re.split(r'([0-9,?]+)([a-zA-Z]+)', grocery_uom, 1)
unit_weight, uom = split_str[1:3]

## Output:  5 kg
SanV
  • 855
  • 8
  • 16
  • This matches nonsense input as well, though, like `,10kg`. You'll want to verify that `unit_weight` is actually a valid number afterwards. – chepner Mar 30 '19 at 15:50
  • @chepner agreed but otherwise you could miss out on numbers > 999 with commas. somehow will need to add "do not match leading comma"? or is there a better standard protocol for such cases?? – SanV Mar 30 '19 at 16:03
  • You're assuming the input *uses* commas. If it does, then of course the regular expression has to accommodate them. But you'll want to verify that the commas are used *correctly*, with something like `r'((?:[1-9]\d{0,2}))(?:,\d{3})*)`. – chepner Mar 30 '19 at 16:09
  • (You're also assuming the `,` is the thousand's separator, which isn't a universal convention, but dealing with that is beyond the scope of this question.) – chepner Mar 30 '19 at 16:10
1

You need to use regex split rather than simple string split and the precise pattern you are looking for splitting is this,

(?<=\d)(?=[a-zA-Z]+)

Basically the point where is preceded by digit, hence this regex (?<=\d) and followed by alphabets, hence this regex (?=[a-zA-Z]+) and it can be seen in this demo with pink marker.

Check the pink marker from where the split will take place

Also, here is your modified Python code,

import re

grocery_uom = '5kg'
unit_weight, uom = re.split(r'(?<=\d)(?=[a-zA-Z]+)', grocery_uom, 1)

print('unit_weight: ', unit_weight, 'uom: ', uom)

Prints,

unit_weight:  5 uom:  kg

Also, if there can be optional space between the number and units, you can better use this regex which will optionally consume the space too during split,

(?<=\d)\s*(?=[a-zA-Z]+)

Demo allowing optional space

Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36