I see a number of problems with this original regex:
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
First, parentheses both group and match, though when you quantify your match, only the last iteration is captured, so matching like (.)*
will only store the last character; you wanted (.*)
for that. Since it's greedy, that will be the character before the space preceding a dollar sign, which given your data will always be a space. Similarly, you're quantifying a group at the beginning with (\s){1,10}
, which captures only the last whitespace character. In this case, you don't need the group since \s
is a single space character, so you can simply use \s{1,10}
.
Here is a piece-by-piece explanation of what that regular expression does.
Capturing solution
The following regex captures the quantity ($1), item description ($2), whether the price is parenthesized ($3), and the price ($4):
^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$
Explained and matched to your sample at regex101.
Separated out and commented (assumes the /x flag is supported):
/ # begin regex
^\s* # start of line, ignore leading spaces if present
(\d+) # $1 = quantity
\s+ # spacing as a delimiter
(.*\S) # $2 = item: contains anything, must end in a non-space char
\s+ # spacing as a delimiter
(\(?) # $3 = negation, an optional open parenthesis
\$ # dollar sign
([0-9.]+) # $4 = price
\)?\s*$ # trailing characters: optional end-paren and space(s)
/x # end regex, multi-line regex flag
with sample perl code executed from a command line:
perl -ne '
my ($quantity, $item, $neg, $price)
= /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/;
if ($item) {
if ($neg) { $price *= -1; }
print "<$quantity><$item><$price>\n"
}' RECEIPT_FILE
(If you want that as a perl script, wrap the code with while(<>) { }
and you're done.)
This assigns the variables $quantity, $item, and $price to the itemized lines on your receipt. I am assuming that a parenthesized item is to be subtracted (but I can't verify that since the totals are nonsensical), so $neg notes the existence of a parenthesis so the $price can be negated.
I set the output to use angle brackets (<
and >
) to indicate what each variable stores.
The output of your given sample receipt would therefore be:
<1><Brek Wrap Combo /A><-0.76>
<1><Bacon-wrap><3.79>
<1><Grilled><0.00>
<1><5 Pieces Bacon-wrap><0.00>
<1><Orange><1.40>
<1><Deposit><0.10>
Prices only solution
You didn't say what you wanted to match. If you don't care about anything but the prices and there are no negative values, you don't need matchers if you have negative look-behind or \K
:
grep -Po '^\s*[0-9].*\$\K[0-9.]+' RECEIPT_FILE
Grep's -P
flag invokes libpcre (which may not be available if you're on an old or embedded system) and -o
displays only the matching text. \K
denotes the start of the match. Put the \$
after the \K
if you want to capture it. (See also the regex101 description and matches.)
Output from that grep command:
0.76
3.79
0.00
0.00
1.40
0.10
Prices only – with awk
There aren't great ways to handle this regex with efficiency. If you're processing through a mountain of content, you'll feel the hurt. Here's a solution using awk
that should be significantly faster. (The difference won't be noticeable with a small input.)
awk '$1 / 1 > 0 && $NF ~ /\$/ { gsub(/[()]/, "", $0); print $NF; }' RECEIPT_FILE
Commented version with explanation:
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
print $NF; # print the contents of the last field
}' RECEIPT_FILE
Prices only – with awk, supporting negative prices
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
neg = 1;
if ( $NF ~ /\(/ ) { # the last field has an open parenthesis
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
neg = -1;
}
print $NF * neg; # print the last field, negated if parenthesized
}' RECEIPT_FILE