How can I fix this regex pattern?

Question

I'm trying to create a regex pattern to capture information from my credit card invoice, which I can only get as a PDF. I'm copying the text to a text editor and then I use the replace tool in notepad++ to convert that copied text to a CSV one.

I'm having problem with negative values.

Given this piece of text:

16/04 RC GRACAS 2 - 0,02
SAÚDE .RECIFE

16/04 RC GRACAS 2 02/03 45,97
SAÚDE .RECIFE

The text above contains data for 2 bill data. The regex should capture the following groups in the first entry:

"16/04": date
"RC GRACAS 2": description
"-": value sign
"0,02": value
"SAÚDE .RECIFE": categories

As well as the following groups in the second entry:

"16/04": date
"RC GRACAS 2 02/03": description
"": value sign
"45,97": value
"SAÚDE .RECIFE": categories

The current regex I have is this: ^(\d{2}/\d{2})\s+(.*)\s+([-+]?)\s?(\d{1,3}(?:\.\d{3})*(?:,\d+)?)\s+(.*)?

The problem I'm having is that in the first purchase, the regex can't capture the minus sign, it becomes parte of the second group (description).

How can I change this regex to capture that sign in its own group?

I recommend using https://regex101.com/ to test your regex. This is a great tool that can help you figure out how to get it right, even if it is by trial and error. — Code-Apprentice, May 28 '23 at 20:26
Aside from the problem with the minus sign your regex permits a match if only the first of the two lines is present or if there are blank lines between the two lines. Is that your intention? — Cary Swoveland, May 29 '23 at 19:30

InSync · Answer 1 · 2023-05-28T23:37:07.750

. matches everything, including - and +. You can prevent them from matching those two by changing the second group to ([^-+]*) if your descriptions are guaranteed to have neither hyphens nor pluses:

^
(\d{2}/\d{2})\s+
([^-+]*)\s+
([-+]?)\s?
(\d{1,3}(?:\.\d{3})*(?:,\d+)?)\s+
(.*)?

Try it on regex101.com.

Alternatively, here's my suggestion:

^                                # Match at the start of line
(?<date>\d{2}/\d{2})             # a date,
(?:                              # a description
  \s+(?<description>.*?)         # consists of at least some spaces
)??                              # (optional, lazily matched)
(?:                              # 
  \s+                            # some other spaces
  (?:(?<value_sign>[-+])\s)?     # then a sign and a space, collectively optional,
  (?<value>                      # followed by a value
    \d{1,3}(?:\.\d{3})*(?:,\d+)? # (which is a number)
  )                              # 
)                                # 
\s*$                             # right before the end of line,
\n                               # after which is a new line
(?<categories>.*)                # containing categories.

Try it on regex101.com.

score 1 · Answer 2 · answered May 28 '23 at 20:18

The issue is that the description group is too general (and matches the next group's pattern as well) and the sign group is optional so it gets captured by the description. What really makes it a problem is that you have another .* group that is optional to the right of the sign group.

You can solve this by making two simple changes to your regular expression. The first is to make the description group lazy (by adding a ? after the *. The second is to add an end-line $ to the expression:

^(\d{2}/\d{2})\s+(.*?)\s+([-+]?)\s?(\d{1,3}(?:\.\d{3})*(?:,\d+)?)\s+(.*)?$
                    ^                                                    ^

The change to a lazy group prevents the description field from passing into the next group and the end-line adds more structure to the expression, allowing the laziness to work.

score 1 · Accepted Answer · answered May 28 '23 at 23:31

1

Use reluctant quantifier .*?, which matches as few characters as possible, and use [\r\n] to match the new line:

^(\d\d\/\d\d)\s+(.*?)\s+([-+])?\s?(\d{1,3}(?:\.\d{3})*(?:,\d+)?)[\r\n]+(.*)?

See live demo.

answered May 28 '23 at 23:31

Bohemian

412,405
93
575
722

Cary Swoveland · Answer 4 · 2023-05-29T19:38:42.880

You can use the following regular expression.

(?x)                        # invoke free-spacing mode
^                           # match beginning of line
(?<date>\d{2}\/\d{2})       # match 2 digits, '/', 2 digits and save to capture
                            # group 'date'
[ ]+                        # match 1 or more spaces, as many as possible

(?<description>[\w ]*\w(?:[ ]+\d{2}\/\d{2})?)
                            # match zero or more word chars or spaces, as many as
                            # possible, followed by a word char,  optionally
                            # followed by a one or more spaces, 2 digits, '/', 2 digits,
                            # save to capture group 'description'
[ ]+                        # match 1 or more spaces, as many as possible
(?<value_sign>[-+]|)        # match '-', '+' or an empty space, save to caputure
                            # group 'value_sign' 
[ ]+                        # match 1 or more spaces, as many as possible
(?<value>\d+,\d{2})         # match 1 or more digits, ',', 2 digits, save to capture
                            # group 'value'
\r?\n                       # match a line feed optionally preceded by a carriage
                            # return (for Windows support)
(?<categories>\S.*)         # match a non-whitespace character followed by
                            # zero or more characters other than line
                            # terminators, as many as possible, save to
                            # capture group 'categories'

Demo

If free-spacing mode were not specified this would be

^(?<date>\d{2}\/\d{2}) +(?<description>[\w ]*\w(?: +\d{2}\/\d{2})?) +(?<value_sign>[-+]|) +(?<value>\d+,\d{2})\r?\n(?<categories>\S.*)

If numbered capture groups were used this would be

^(\d{2}\/\d{2}) +([\w ]*\w(?: +\d{2}\/\d{2})?) +([-+]|) +(\d+,\d{2})\r?\n(\S.*)

How can I fix this regex pattern?

4 Answers4