grab required field values from the paragraph block using regex in python

Question

I've a text file, from that I have extracted these two paragraph block. The text example is given below.

Text Example:

EXONERAR, com validade a contar de 19 de agosto de 2020, DE- NILSON DE BRITO LIMA, ID FUNCIONAL Nº 2100423-4, do cargo em comissão de Coordenador, símbolo DAS-8, da Coordenadoria de Gestão Centralizada de Serviços, da Superintendência de Gestão Centralizada, da Subsecretaria de Logística, da Secretaria de Estado de Planejamento e Gestão. Processo nº SEI-120001/010643/2020

EXONERAR, a pedido, NADIA NAKAMURA VIEIRA, ID FUNCIONAL Nº 5099589-8, do cargo em comissão de Assessor Especial, símbolo DG, da Secretaria de Estado de Planejamento e Gestão. Processo nº SEI-150001/004627/2020

EXONERAR, com validade a contar de 26 de novembro de 2020, BRUNO RAFAEL ROCHA COSTA, ID FUNCIONAL Nº 5108093-1, do cargo em comissão de Assessor, símbolo DAS-7, da Assessoria de Planejamento e Gestão, da Presidência, da Superintendência de Des- portos do Estado do Rio de Janeiro - SUDERJ, da Secretaria de Es- tado de Esporte, Lazer e Juventude. Processo nº SEI- 3 0 0 0 0 2 / 0 0 0 4 11 / 2 0 2 0 .

EXONERAR, com validade a contar de 16 de novembro de 2020, LUIS HENRIQUE FERREIRA DE AQUINO, ID FUNCIONAL Nº 1914315-0, do cargo em comissão de Assistente II, símbolo DAI-6, da Secretaria de Estado de Planejamento e Gestão. Processo nº SEI120001/014825/2020:

From the above text block I want to grab the bold values only from each paragraph as a individual row.

What I have tried:

r"\b(?:(?:EXONERAR|d[ae]|por|símbolo)\s([^,]+?)(?: e Gestão)?,|\b(?!SEI\b)([A-Z\d]+-\s*\d+)|SEI-\s*([\d /]+)\b)"

My Current Output:

https://regex101.com/r/FCimoW/1

My current output is almost OK but having issue to not matching all the required parts e.g CAPITALIZED name part.

Perhaps like this? https://regex101.com/r/gpbqU9/1 – The fourth bird Dec 01 '20 at 17:03 — The fourth bird, Dec 01 '20 at 17:03

score 2 · Accepted Answer · answered Dec 01 '20 at 17:09

2

For the bold uppercase parts, you can add an alternation, matching 1 or more uppercase words separated by a whitespace char or a hyphen and that end with a comma.

\b([A-Z]+(?:[\s-]+[A-Z]+)+(?=,)

Regex demo for the full pattern

answered Dec 01 '20 at 17:09

The fourth bird

154,723
16
55
70

`[A-Z]+` It is capturing the CAPITALIZED name but not the international characters. See: https://regex101.com/r/wqAaSg/1 – A l w a y s S u n n y Dec 02 '20 at 15:04
@AlwaysSunny Try it like this using `\p{Lu}` https://regex101.com/r/7iNy7o/1 – The fourth bird Dec 02 '20 at 15:06
may be it is not valid in python, getting error `sre_constants.error: bad escape \p at position 113` – A l w a y s S u n n y Dec 02 '20 at 15:08
added `\` before that like `\\p` but it is now creating issue with capturing on python – A l w a y s S u n n y Dec 02 '20 at 15:10
I am sorry, that does not work indeed in Python. It does work when you install the PyPi regex module. You can check [this page](https://stackoverflow.com/questions/36187349/python-regex-for-unicode-capitalized-words) to match uppercase chars in python. Or you can add the specific ones to the character class that you want to match. – The fourth bird Dec 02 '20 at 15:14
1

Please don't be. I've installed that **regex** package and it is working now. Thanks for the link. Earlier I saw that link but not sure will it work for me or not. But when you advised I used it and it is working as per my requirements. Thanks a million :) – A l w a y s S u n n y Dec 02 '20 at 16:44

grab required field values from the paragraph block using regex in python

1 Answers1

Linked