4

I have the following text that I extracted from a PDF using UiPath Studio's OCR. It's the same block of text repeated 3 times due to it being the original, duplicate & triplicate of the same PDF page.

Os bens/serviços foram colocados à disposição do adquirente em 2020-04-16 * Data/Hora início de transporte: 2020-04-16 às 11:52

Total Líquido               500,00
Total de Descontos 500,00         
Desconto Documento                
Total de IVA                115,00
Total do Documento (EUR)    615,00

IVA      Incidência   Valor do IVA
Isento                            
6%                                
13%                               
23%      500,00       115,00      

b5El-Processado por programa certificado n.º75/AT.

Os bens/serviços foram colocados à disposição do adquirente em 2020-04-16 * Data/Hora início de transporte: 2020-04-16 às 11:52

Total Líquido               500,00
Total de Descontos 500,00         
Desconto Documento                
Total de IVA                115,00
Total do Documento (EUR)    615,00

IVA      Incidência   Valor do IVA
Isento                            
6%                                
13%                               
23%      500,00       115,00      

b5El-Processado por programa certificado n.º75/AT.

Os bens/serviços foram colocados à disposição do adquirente em 2020-04-16 * Data/Hora início de transporte: 2020-04-16 às 11:52

Total Líquido               500,00
Total de Descontos 500,00         
Desconto Documento                
Total de IVA                115,00
Total do Documento (EUR)    615,00

IVA      Incidência   Valor do IVA
Isento                            
6%                                
13%                               
23%      500,00       115,00      

b5El-Processado por programa certificado n.º75/AT.

I need to extract the 4 character code behind "-Processado por programa" but just want 1 match or the 1st match.

Already tried [^*]+(?=-Processado\spor\sprograma) and (.*?)(?=-Processado\spor\sprograma) but that outputs me 3 matches.

It worked when I removed the /g flag but I'm using UiPath Studio's RegEx extractor and I don't know how to remove that flag on that program.

lcvalves
  • 77
  • 1
  • 9

2 Answers2

4

You could match all lines that do not start with 4 word characters and -Processado por programa using a negative lookahead.

When you encounter the line that does, capture the first 4 word characters in group 1

\A.*(?:\r?\n(?!\w{4}-Processado\spor\sprograma\b).*)*\r?\n(\w{4})

Explanation

  • \A.* Assert the position at the start of the string and any char except a newline 0+ times
  • (?: Non capture group
    • \r?\n Match a newline
    • (?!\w{4}-Processado\spor\sprograma\b) Negative lookahead, assert not -Processado por programa directly to the right
    • .* Match the rest of the line
  • )* Close non capture group and repeat 0+ times to match all the lines
  • \r?\n(\w{4}) Match a newline and capture 4 word characters in group 1

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • What if there's text behind the 4 char code? I have a different PDF text extraction that has 2 duplicates lines of `Software PHC - YDWL-Processado por programa ` and the goal is to extract the YDWL. – lcvalves Jul 20 '20 at 15:00
  • 1
    @LyZ4RD Then you might do it like this https://regex101.com/r/5jJ7Dn/1 – The fourth bird Jul 20 '20 at 15:04
  • You're the real MVP, thanks again. I might need to hit you up a few more times in the future. – lcvalves Jul 20 '20 at 15:09
  • Hey, can you tell me how the Regex would look like if the 4 character code also had other characters such as '/', '*', '.', etc...? I have an instance of an invoice that has the code `CpE/` in it, don't know if the invoice generator may create codes with different characters other than letters and numbers. – lcvalves Jul 25 '20 at 16:08
  • 1
    @LyZ4RD You could add the characters that you want to allow https://regex101.com/r/0jL9nb/1 or you could match non whitespace chars excluding the hyphen https://regex101.com/r/Tc4Xkn/1 – The fourth bird Jul 25 '20 at 16:12
  • 1
    I'll go with the second one because these codes may also have numbers on them, feels like that's the right approach. Thanks once again! – lcvalves Jul 25 '20 at 16:15
1
/(\w{4})-Processado/g

is what you are searching for. Look on the image for the Regex tester. It works as intended on exactly 4 chars. When you need help in UiPath to apply it let me know.

enter image description here

kwoxer
  • 3,734
  • 4
  • 40
  • 70
  • That outputs 3 matches and it includes the "-Processado" string. I just want the 4 char code. – lcvalves Jul 20 '20 at 14:34
  • Not really. Just take the first group and all is fine. And it does not include the "-Processado" string. Have a try before saying wrong things please. – kwoxer Jul 20 '20 at 15:20