Given the following list of prices, I am trying to figure out how to normalize/extract only the digits.
INPUT DESIRED_OUTPUT
CA$1399.00 1399.00
$1399.11 1399.11
$1,399.22< 1399.22
Z$1 399.33 1399.33
$1399.44# 1399.44
C$ 1399.55 1399.55
1,399.66 1399.66
1399.77 1399.77
,1399.88 1399.88
25 1399.88 1399.88
399.99 399.99
88.88 99.99 99.99 (if >2 matches on one line, only the last one matters)
.1399.88 DO NOT MATCH (not a price; too many ".")
666.000 DO NOT MATCH (not a price: too many 0's)
I suppose it is a good idea to begin is with what they all have in common:
- Prices always contain
.NN
, but never contain.NNN
Upon further inspection, other rules become apparent:
.NN
must be preceded by one or moredigits
.NNN.NN
can be preceded by either,
,, or a simple
digit
, but nothing else.- Anything following
.NN
and preceding*N.NN
marks the end of the match. - Finally, the regex needs to consider commas in things like
1,399.66
(1399.66
) to determine whether it is a price, but then strip them.1, 399.66
, for instance does not equal1399.66
: it should be399.66
.
I am looking at sed
, grep
, and awk
for a portable and efficient solution. How should I go about approaching this problem?
I found a similar question, but I have no idea how to try the following regex with sed
:
^\d+(,\d{1,2})?$
EDIT: Yes, my input format is can be a little weird, because it is the result of the concatenation of scraped pages.