1

I am trying to create a regex for getting legislation references in pt-br like:

  • Lei 11.738/2008
  • Lei nº 9.394/96
  • Lei Estadual nº 6.834
  • Lei 5.539/2009
  • Decreto 30.825/2002
  • Lei 1614 de 21 de janeiro de 1990
  • art. 1.039 do CPC/2015

ps.: lei (law), decreto (decree) and cpc (civil procedure code)

My current attempt is:

(?<LEGISLACAO>(art(\.|igos?)\s[\d\.º]+(,\s*?(caput|§),\s*?)?\s+?d[oa]\s+?)*?((lei(\s(estadual|nacional|federal))??|decreto|N?CPC)\s*?(n[º\.])*?\s*?[\d\.\/º]+)(\s*?de\s*?\d{1,2}\s*?de\s*(janeiro|fevereiro|março|abril|maio|junho|julho|agosto|setembro|outubro|novembro|dezembro)\s*?de\s*?\d{2,4})?)

Regex101: https://regex101.com/r/69ggnm/1

But this regex still have some flaws and is capturing some undesired strings like:

  • lei nº.
  • lei.
  • CPC.

And is also getting the "period" in the end of some citations like:

  • LEI FEDERAL Nº 11.738/2008.

And it is not getting these ones:

  • lei nº. 1.060/50
  • art.334, § 5º do CPC
  • artigo 85, §11º, do CPC

How could be a regex to avoid those problems and still getting the correct results?

celsowm
  • 846
  • 9
  • 34
  • 59
  • What if you just add `\b` at the end to only match before a word boundary? – Wiktor Stribiżew Aug 16 '23 at 13:29
  • If that is not what is expected, try to require a digit to appear at the end: `(?(art(\.|igos?)\s[\d.º]+(,\s*?caput,\s*?)?\s+?d[oa]\s+?)*?((lei(\s(estadual|nacional|federal))??|decreto|N?CPC)\s*?(n[º\.])*?\s*?\d(?:[\d.\/º]*\d)?)(\s*?de\s*?\d{1,2}\s*de\s*(janeiro|fevereiro|março|abril|maio|junho|julho|agosto|setembro|outubro|novembro|dezembro)\s*?de\s*\d{2,4})?)`? – Wiktor Stribiżew Aug 16 '23 at 13:36
  • @WiktorStribiżew I updated with two more cases – celsowm Aug 16 '23 at 16:33
  • In order to write a regular expression, you must first express in English what the rules are that you're trying to match. The example are illustrative, but you have to define what the rules are. So what are the rules you want to use? – Andy Lester Aug 17 '23 at 14:41

1 Answers1

1

Here's a suggestion tested with grep -P.

PATT1='(lei|decreto)(( estadual| federal)? nº.?)? \d+.\d{3}(/\d{4}|/\d{2})?'

PATT1='(lei|decreto) ((estadual |federal )?n(.|º.?) )?\d+.\d{3}(/\d{4}|/\d{2})?'

PATT1='(lei|decreto) (estadual |federal )?(n\. |nº\.? )?\d+(\.\d+)?(/\d+)?'
PATT2='lei \d+ de \d+ de (janeiro|fevereiro|março|abril|maio|junho|julho|agosto|setembro|outubro|novembro|dezembro) de \d{4}'
PATT3='art(igo |\.)\d+, § ?\d+º,? do CPC'
PATT4='art\. \d+\.\d+ do CPC/\d{4}'

If "INPUTFILE" contains the following:

 1  Lei 11.738/2008
 2  Lei nº 9.394/96
 3  Lei Estadual nº 6.834
 4  Lei estadual 5.539/09
 5  Lei 5.539/2009
 6  LEI FEDERAL Nº 11.738/2008.
 7  lei nº. 1.060/50
 8  Lei n. 11.738/2008
 9  Lei n. 94/1947
10  Lei 1614 de 21 de janeiro de 1990
11  Decreto 30.825/2002
12  art. 1.039 do CPC/2015
13  art.334, § 5º do CPC
14  artigo 85, §11º, do CPC

... then grep -P -o -i -e "\b($PATT1|$PATT2|$PATT3|$PATT4)\b" "INPUTFILE" seems to match every target expressions.

Would that meet your needs?

Update:

Edited "PATT1" in order to capture "Lei n. 11.738/2008"

Edited "PATT1" in order to capture "Lei estadual 5.539/09" and "Lei n. 94/1947"

Grobu
  • 599
  • 1
  • 11