0

Background

The background of my question: to find all mA unit in all upper/lower case. To prompt the user as much surrounding chacracters as possible where it is mis-used as ma/Ma/MA, so that user can search and locate easily.

As we know mA is a valid unit used for electrical current. To be simple we only use integer number, so every line in the text

case 1, only number and unit: 1mA
case 2, number and unit, space: 1mA current
case 3, number and unit, punctuation: 1mA,
case 4, number and unit, Unicode characters: 1mA电流I   

is a valid expression.

But

case 5, 1mAcurrent

should be an invalid expression since no English letters are allowed to follow the unit without space

My regular expression trying

So what is the correct regular expression in this situation? I have used every line in the following text

case 5 is taken as a right one, this is wrong      \d{1,}mA
case 4 is ignored                                  \d{1,}mA\b
case 4 is ignored                                  \d{1,}mA[^a-zA-Z]*\b

as you have read, none is correct.

My complex code

This the python code I am using, you will find I use python's if-else

import re
text = '''
case 1, only number and unit: 1mA
case 2, number and unit, space: 2mA current
case 3, number and unit, punctuation: 3mA,
case 4, number and unit, Unicode characters: 4mA电流I   
case 5, 5mAcurrent
'''
lst = text.split('\n')
lst = [i for i in lst if i]

pattern = r'(?P<QUANTITY>\d{1,}mA)(?P<TAIL>.{0,5})'

for text in lst:
    for match in re.finditer(pattern, text):    
        if not re.match('[a-zA-Z]', match.group('TAIL')): # extra line
            print(match.group('QUANTITY'), ', ', match.group('TAIL'))      

which outputs

1mA ,  
2mA ,   curr
3mA ,  ,
4mA ,  电流I  

obviously, bad expression case 5, 5mAcurrent is not been taken into account as I expected

Ask for help

is there an easy way to implement it in one regular expression pattern? Thanks

oyster
  • 537
  • 3
  • 15
  • Possible duplicate of https://stackoverflow.com/questions/16492933/regular-expression-to-match-boundary-between-different-unicode-scripts – tripleee Jun 12 '19 at 02:26

5 Answers5

1

Use a negative lookahead just after the unit, that will check if there no alpha:

pattern = r'(?P<QUANTITY>\d+mA)(?![a-z])(?P<TAIL>.{0,5})'
#                       here __^^^^^^^^^ 

Code:

pattern = r'(?P<QUANTITY>\d+mA)(?![a-z])(?P<TAIL>.{0,5})'

for text in lst:
    for match in re.finditer(pattern, text):    
        print(match.group('QUANTITY'), match.group('TAIL'))    
Toto
  • 89,455
  • 62
  • 89
  • 125
  • thanks, it solves my case 1-5. Today, I met case 6 "case 6, 6mA to 7mA". In all case 1-6, `number + unit` is first class, so I do hope that "6mA to " and "7mA" are matched. Is it possible? – oyster Jun 13 '19 at 14:00
  • The background of my question: to find all `mA` unit in all upper/lower case. To prompt the user as much surrounding chacracters as possible where it is mis-used as `ma/Ma/MA`, so that user can search and locate easily. – oyster Jun 13 '19 at 14:25
  • @oyster: If you have only twice value+unit, you can do: `(?P\d+mA)(?![a-z])(?:.+?(?P\d+mA)(?![a-z]))*(?P.{0,5})`. If you want to match case insensitive, add `(?i)` at the beginning of the regex: `(?i)(?P\d+mA)(?![a-z])(?:.+?(?P\d+mA)(?![a-z]))*(?P.{0,5})` – Toto Jun 13 '19 at 14:53
0

You could try doing a regex search with the following pattern:

\d+mA(?= |current|电流I|,|$)

This would match e.g. 1mA followed by either a space, the word current, the Chinese term 电流I, comma, or the end of the input.

input = "Here 1mA also 2mAcurrent and 3mA电流I and 4mA, and also 5mA"
matches = re.findall(r'\d+mA(?= |current|电流I|,|$)', input)
print(matches)

This prints:

['1mA', '2mA', '3mA', '4mA', '5mA']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • "current|电流I|," are all samples. If I can enumerate all the case like this, I think I can do not use regular expression . What about "5mA stable current" which follows the rule "there is a space behind the unit, so it is valid". – oyster Jun 12 '19 at 02:29
  • @oyster But my pattern would already be picking up on `5mA stable current`. You only need to list the strings which might be directly adjoined to the end of `5mA`, with no space in between. – Tim Biegeleisen Jun 12 '19 at 02:32
  • We can't list which English/Chinese words are allowed because there are unlimited possiblity – oyster Jun 12 '19 at 02:53
0
pattern = r'(?P<value>\d+)(?P<units>mA)(\S+|)'
text = ['1mA','1mA电流I','1mA,','1mAcurrent']

for i,j in enumerate(text):
    match = re.match(pattern,j)
    if match:
        print("Text "+match[0]+" matches with value:"+match['value']+ 
        ' Units:'+match['units'])

The above code matches all cases and uses named groups to make callable sections. There are 3 groups; I named the first 2 (values and units)

You can expand the units to any other units of interest with pipe separation. \d+ for value matches any integer

tripleee
  • 175,061
  • 34
  • 275
  • 318
Dan Wisner
  • 81
  • 1
  • The 3rd grouping of \S+ matches non space characters (any length) this can match current in english or kanji, or voltage etc. It also hits on the comma. – Dan Wisner Jun 12 '19 at 02:30
  • 1
    That's equivalent to `\S*` which is to say no boundary is fine, too. I understand the OP wants to avoid 2mAcurrent. – tripleee Jun 12 '19 at 02:37
  • So we do not want to hit on the 4th item in the list. – Dan Wisner Jun 12 '19 at 02:39
  • Case 1~4 should be recognized as valid physical quantity (there is number and unit) expression; Case 5 should not be matched since it is not a valid physical quantity expression. However you code think case 5 is valid. – oyster Jun 12 '19 at 02:46
0

If I understand the problem right, we might just want to collect our desired digits, followed by optional spaces, and a mA, which this simple expression might do so:

([0-9]+)(\s+)?(?=mA)

I'm not sure about technicalities, but if we would have float numbers this ([0-9]+) would also change to ([0-9.]+). At the end, we would append a mA to all captured outputs.

Demo

Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
0
pattern = r'(?P<value>\d+)(?P<units>mA)(\s[a-z]+|[\s,]|$)'
pattern2 = r'(?P<value>\d+)(?P<units>mA)([^a-z]\S+)'
text = ['1mA','5mA电流I','1mA,','1mAcurrent','1mA current']

for i,j in enumerate(text):
    match = re.match(pattern,j)
    print(j)
    if match:
        print("Text "+match[0]+" matches with value:"+match['value']+ ' 
        Units:'+match['units'])
    else:
        match = re.match(pattern2,j)
        if match:
            print("Text "+match[0]+" matches with value:"+match['value']+ ' 
            Units:'+match['units'])

This solution ignores Case 5. Using 2 patterns and an else statement when we do not return a match on the first pattern.

Dan Wisner
  • 81
  • 1