Regex for receipt items

Question

I have a simple receipt which I typed out. I need to be able to read the items purchased on the receipt. The sample receipt is below.

               Tim Hortons
              Alwasy Fresh

1   Brek Wrap Combo /A          ($0.76)
1   Bacon-wrap                  $3.79
1   Grilled                     $0.00
1   5 Pieces Bacon-wrap         $0.00
1   Orange                      $1.40
1   Deposit                     $0.10
Subtotal:                       $55.84
GST:                $0.29
Debit:                          $55.84
Take out

         Thanks for stopping by!!
           Tell us how we did

I came up with the following regex string to find the items.

\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}

It works for the most part but there are a few incorrect lines like

4
GST:                $0.29

Can someone come up with a better pattern. Below is a link to see it in action.

http://regexr.com/3cnk9

score 1 · Answer 1 · answered Feb 03 '16 at 22:21

Here's my attempt:

^(\d+)\s+(.*)\s+\(?(\$.+)\)?$

Stub. Remember to turn the multiline option on. Components:

^         - beginning of line
(\d+)     - capture the quantity at the beginning of each line item
\s+       - one or more space
(.*)      - capture the item description
\s+       - one or more space
\(?       - optional open bracket `(` character
($.+)     - capture anything including and after the dollar sign
\)?       - optional close bracket `)` character
$         - end of line

Adam Katz · Accepted Answer · 2016-02-08T21:00:35.387

I see a number of problems with this original regex:

\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}

First, parentheses both group and match, though when you quantify your match, only the last iteration is captured, so matching like (.)* will only store the last character; you wanted (.*) for that. Since it's greedy, that will be the character before the space preceding a dollar sign, which given your data will always be a space. Similarly, you're quantifying a group at the beginning with (\s){1,10}, which captures only the last whitespace character. In this case, you don't need the group since \s is a single space character, so you can simply use \s{1,10}.

Here is a piece-by-piece explanation of what that regular expression does.

Capturing solution

The following regex captures the quantity ($1), item description ($2), whether the price is parenthesized ($3), and the price ($4):

^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$

Explained and matched to your sample at regex101.

Separated out and commented (assumes the /x flag is supported):

/          # begin regex
 ^\s*      # start of line, ignore leading spaces if present
 (\d+)     # $1 = quantity
 \s+       # spacing as a delimiter
 (.*\S)    # $2 = item: contains anything, must end in a non-space char
 \s+       # spacing as a delimiter
 (\(?)     # $3 = negation, an optional open parenthesis
 \$        # dollar sign
 ([0-9.]+) # $4 = price
 \)?\s*$   # trailing characters: optional end-paren and space(s)
/x         # end regex, multi-line regex flag

with sample perl code executed from a command line:

perl -ne '
  my ($quantity, $item, $neg, $price)
    = /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/;
  if ($item) {
    if ($neg) { $price *= -1; }
    print "<$quantity><$item><$price>\n"
  }' RECEIPT_FILE

(If you want that as a perl script, wrap the code with while(<>) { } and you're done.)

This assigns the variables $quantity, $item, and $price to the itemized lines on your receipt. I am assuming that a parenthesized item is to be subtracted (but I can't verify that since the totals are nonsensical), so $neg notes the existence of a parenthesis so the $price can be negated.

I set the output to use angle brackets (< and >) to indicate what each variable stores.

The output of your given sample receipt would therefore be:

<1><Brek Wrap Combo /A><-0.76>
<1><Bacon-wrap><3.79>
<1><Grilled><0.00>
<1><5 Pieces Bacon-wrap><0.00>
<1><Orange><1.40>
<1><Deposit><0.10>

Prices only solution

You didn't say what you wanted to match. If you don't care about anything but the prices and there are no negative values, you don't need matchers if you have negative look-behind or \K:

grep -Po '^\s*[0-9].*\$\K[0-9.]+' RECEIPT_FILE

Grep's -P flag invokes libpcre (which may not be available if you're on an old or embedded system) and -o displays only the matching text. \K denotes the start of the match. Put the \$ after the \K if you want to capture it. (See also the regex101 description and matches.)

Output from that grep command:

0.76
3.79
0.00
0.00
1.40
0.10

Prices only – with awk

There aren't great ways to handle this regex with efficiency. If you're processing through a mountain of content, you'll feel the hurt. Here's a solution using awk that should be significantly faster. (The difference won't be noticeable with a small input.)

awk '$1 / 1 > 0 && $NF ~ /\$/ { gsub(/[()]/, "", $0); print $NF; }' RECEIPT_FILE

Commented version with explanation:

awk '
  # if the quantity is indeed a number and the last field has a dollar sign
  $1 / 1 > 0 && $NF ~ /\$/ {
    gsub(/[()]/, "", $NF);   # remove all parentheses from the last field
    print $NF;               # print the contents of the last field
  }' RECEIPT_FILE

Prices only – with awk, supporting negative prices

awk '
  # if the quantity is indeed a number and the last field has a dollar sign
  $1 / 1 > 0 && $NF ~ /\$/ {
    neg = 1;
    if ( $NF ~ /\(/ ) {      # the last field has an open parenthesis
      gsub(/[()]/, "", $NF); # remove all parentheses from the last field
      neg = -1;
    }
    print $NF * neg;         # print the last field, negated if parenthesized
  }' RECEIPT_FILE

Wiktor Stribiżew · Answer 3 · 2016-02-03T23:16:25.817

You can use

^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)

See the regex demo

This regex should be used with the /m modifier to match data on different lines. In JS, the /g modifier is also required.

Explanation:

^ - start of a line
(\d+) - Group 1 capturing one or more digits
\s+ - one or more whitespaces
(.*?) - Group 2 capturing zero or more any characters but a newline up to the closest
\s+ - one or more whitespaces
\(? - an optional ( (on the first line)
\$ - a literal $
(\d+\.\d+) - Group 3 capturing one or more digits followed with . and one or more digits.

JS demo:

var re = /^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)/gm; 
var str = '               Tim Hortons\n              Alwasy Fresh\n\n1   Brek Wrap Combo /A          ($0.76)\n1   Bacon-wrap                  $3.79\n1   Grilled                     $0.00\n1   5 Pieces Bacon-wrap         $0.00\n1   Orange                      $1.40\n1   Deposit                     $0.10\nSubtotal:                       $55.84\nGST:                $0.29\nDebit:                          $55.84\nTake out\n\n         Thanks for stopping by!!\n           Tell us how we did';

while ((m = re.exec(str)) !== null) {
    document.body.innerHTML += "Pcs: <b>" + m[1] + "</b>, item: <b>" + m[2] + "</b>, paid: <b>" + m[3] + "</b><br/>";
}

It would involve redundant overhead. There is just no need matching more than we need. — Wiktor Stribiżew, Feb 03 '16 at 22:25

score 0 · Answer 4 · answered Oct 20 '21 at 01:48

Adam Katz's answer should be the accepted one! I used this variation of his answer for an implementation in JavaScript:

const receiptRegex = /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/gm
let items = [];
const matches = inputStr.matchAll(receiptRegex);
for (const matchedGroup of matches) {
  const [
    fullString,    //[0] -> matched string "1 Blue gatorade $2.00"
    quantity,      //[1] -> quantity "1"
    item,          //[2] -> item description "Blue gatorade"
    ignoredSymbol, //[3] -> "$" (should probably always ignore)
    price          //[4] -> amount "2.00"
  ] = matchedGroup;

  items.push({
    quantity,
    item,
    price,
  });
}

Regex for receipt items

4 Answers4

Capturing solution

Prices only solution