Background
The background of my question: to find all mA
unit in all upper/lower case. To prompt the user as much surrounding chacracters as possible where it is mis-used as ma/Ma/MA, so that user can search and locate easily.
As we know mA
is a valid unit used for electrical current. To be simple we only use integer number, so every line in the text
case 1, only number and unit: 1mA
case 2, number and unit, space: 1mA current
case 3, number and unit, punctuation: 1mA,
case 4, number and unit, Unicode characters: 1mA电流I
is a valid expression.
But
case 5, 1mAcurrent
should be an invalid expression since no English letters are allowed to follow the unit without space
My regular expression trying
So what is the correct regular expression in this situation? I have used every line in the following text
case 5 is taken as a right one, this is wrong \d{1,}mA
case 4 is ignored \d{1,}mA\b
case 4 is ignored \d{1,}mA[^a-zA-Z]*\b
as you have read, none is correct.
My complex code
This the python code I am using, you will find I use python's if-else
import re
text = '''
case 1, only number and unit: 1mA
case 2, number and unit, space: 2mA current
case 3, number and unit, punctuation: 3mA,
case 4, number and unit, Unicode characters: 4mA电流I
case 5, 5mAcurrent
'''
lst = text.split('\n')
lst = [i for i in lst if i]
pattern = r'(?P<QUANTITY>\d{1,}mA)(?P<TAIL>.{0,5})'
for text in lst:
for match in re.finditer(pattern, text):
if not re.match('[a-zA-Z]', match.group('TAIL')): # extra line
print(match.group('QUANTITY'), ', ', match.group('TAIL'))
which outputs
1mA ,
2mA , curr
3mA , ,
4mA , 电流I
obviously, bad expression case 5, 5mAcurrent
is not been taken into account as I expected
Ask for help
is there an easy way to implement it in one regular expression pattern? Thanks