0

Sample Text:

"UNCKEV\nPumpkins 10/1/20-2030\nRunners\nha\nH[ 12 ]\nA[ O ]\nKNOWLEDGI\nPLA\nDISTRIBUTION\nHOME TEAM\nPINK VISITING TEAM\nBLANCHE BUREAU NATIONAL\nJAUNE \u00c9C\nALE\nPR\u00c9CISER LES DE\nSEULEMENT\nOFF\nSORTIE\nSTART\nD\u00c9BUT\nON\nRETOUR\nPER\nP\u00c9R.\nMIN\nSERV\nPURG\nOFFENCE\nINFRACTION\nDUR\u00c9E\nNo.\nDU\nNeinterferCACE =\n188 Cross clicak 3\n1010hgh shicle\n"

I'm trying to extract H[(wildcard)] and A[(wildcard)] from the sample text, separately.

If I use x = re.search('H\[.*\]', ocr[0]) it finds the whole string H[ 12 ]\nA[ O ]

If I use 'A\[.*\]' it will find A[ O ] by itself - but I can't seem to just find H[ 12 ].

StabCode
  • 106
  • 1
  • 6
stygarfield
  • 107
  • 9

3 Answers3

0

This has to do with greedy qualifiers in Python's regular expression library: https://docs.python.org/3/library/re.html . ctrl-F to find: greedy.

The greedy qualifier * wants to match as many characters as possible. To make it non-greedy, a ? qualifier should be introduced. The remedied regex can thus be: H\[.*?\]

To make this search work for any capitalized alphabet character, try: [A-Z]\[.*?\]

Hope this helps!

iiKop47
  • 156
  • 6
-1

Use a non greedy pattern:

\b[AH]\[.*?\]

Python script:

inp = "UNCKEV\nPumpkins 10/1/20-2030\nRunners\nha\nH[ 12 ]\nA[ O ]\nKNOWLEDGI\nPLA\nDISTRIBUTION\nHOME TEAM\nPINK VISITING TEAM\nBLANCHE BUREAU NATIONAL\nJAUNE \u00c9C\nALE\nPR\u00c9CISER LES DE\nSEULEMENT\nOFF\nSORTIE\nSTART\nD\u00c9BUT\nON\nRETOUR\nPER\nP\u00c9R.\nMIN\nSERV\nPURG\nOFFENCE\nINFRACTION\nDUR\u00c9E\nNo.\nDU\nNeinterferCACE =\n188 Cross clicak 3\n1010hgh shicle\n"
matches = re.findall(r'\b[AH]\[.*?\]', inp)
print(matches)

This prints:

['H[ 12 ]', 'A[ O ]']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Both regexes do not match. I do not understand why you posted two different ones without any explanation. Additionally, none of them capture the required values. https://regex101.com/r/gcZZzW/1 – Pan Jan 14 '20 at 00:35
-1

Try this:

H\[ (\w+) \](?:.|\n)+A\[ (\w+) \]

If you know that the H and A parameter will always be separated by a newline and nothing else, replace (?:.|\n)+ with only \n.

I'm not sure what the contents of your H and A variables can be but \w should capture most of them.

Pan
  • 331
  • 1
  • 7