3

I have some strings which have this structure: <name> (<unit>). I would like to extract name and unit; to perform this task I use regex and in most cases it is all fine.
However, in some cases the <unit> is formed by greek characters, like Ω. In these cases, my code fails to extract the two desired parts.
Here is my code:

import re

def name_unit_split(text):
    name = re.split(' \([A-Za-z]*\)', text)[0]
    unit = re.findall('\([A-Za-z]*\)', text)

    if unit != []:
        unit = unit[0][1:-1]
    else:
        unit = ''

    return name, unit

print(name_unit_split('distance (mm)'))

and I get:

('distance', 'mm')

But when I try with:

print(name_unit_split('resistance (Ω)'))

I get:

('resistance (Ω)', '')

I searched for other regex placeholders and try to use these, without success:

name = re.split(' \([\p{Greek}]*\)', text)[0]
unit = re.findall('\([\p{Greek}]*\)', text)

How can I find greek characters (one or more, grouped) in a string using regex?
Furthermore, is there a better way to perform the above described task using regex? I mean: there is a way to extract both <name> and <unit> and save them in name and unit with regex?

Zephyr
  • 11,891
  • 53
  • 45
  • 80

2 Answers2

9

Just like Latin alphabets, Greek alphabets occupy a continuous space in the utf-8 encoding, so you can use \([α-ωΑ-Ω]*\) instead of \([A-Za-z]*\ to construct your regex.

I would personally prefer to use a regex like "[A-Za-z]* \([α-ωΑ-Ω]*\)" to check if the pattern holds and use string functions to do split jobs. But I believe it depends on your personal preference.

whilrun
  • 1,701
  • 11
  • 21
2

A no-regex solution for the structure <name> (<unit>) is str.partition:

>>> name, _, unit = "resistance (Ω)"[:-1].partition(" (")
>>> name
'resistance'
>>> unit
'Ω'
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38