-2

I am trying to clean some data including texts like '6cm*8cm', '6cmx8cm', and '6*8'. I want to modify them so that they become similar. Notice that numbers are changeable, so the data may have '3cm*4cm' etc.

# input strings
strings = [
    "6cm*8cm",
    "12mmx15mm",
    'Device stemmer 2mm*8mm',
    'Device stemming 2mmx8mm'
]
# My desired output would be:
desired_strings = [
    '6*8',
    '12*15',
    'Device stemmer 2*8',
    'Device stemming 2*8'
]

I am using python's 're'. My preference is to convert them to a simple '6*8' (i.e, number*number). Note that in some of the entries data has strings like: 'Device stemmer 2mm*8mm', and I do not want to change other words.

Is there a pythonic way with regex to modify all the possible combinations of numbers and units paired with each other?

maaniB
  • 595
  • 6
  • 23

1 Answers1

0

I used:

import re

strings = [
    "6cm*8cm",
    "12mmx15mm",
    'Device stemmer 2mm*8mm',
    'Device stemming 2mmx8mm'
]

for i in strings:
    result = re.sub(r"([0-9]+)(cm|mm)(\*|x)([0-9]+)(cm|mm)", r"\1*\4", i)
    print(result)

Notes:
([0-9]+): matches numbers,
(cm|mm): matches units and | stands for logical OR,
(\*|x): matches \* or x as the separator of pairs,
\1: gives the first group (here the first number e.g., 6),
\4: gives the fourth group (here the second number e.g., 8)

https://regex101.com/ and this answer helped.

maaniB
  • 595
  • 6
  • 23
  • 2
    You only need 2 capturing groups as you don't need to keep what you are going to replace, and you could use a character class `[cm]m` and `[*x]` instead of the alternations `([0-9]+)[cm]m[*x]([0-9]+)[cm]m` https://regex101.com/r/olVPIG/1 – The fourth bird Aug 03 '20 at 17:00