4

Given the following list of prices, I am trying to figure out how to normalize/extract only the digits.

INPUT          DESIRED_OUTPUT

CA$1399.00     1399.00
$1399.11   1399.11
$1,399.22<     1399.22
Z$1 399.33     1399.33
$1399.44#      1399.44
C$ 1399.55     1399.55
1,399.66       1399.66
1399.77        1399.77
,1399.88       1399.88
25 1399.88     1399.88
399.99          399.99
88.88 99.99      99.99 (if >2 matches on one line, only the last one matters)
.1399.88         DO NOT MATCH (not a price; too many ".")
666.000          DO NOT MATCH (not a price: too many 0's)

I suppose it is a good idea to begin is with what they all have in common:

  • Prices always contain .NN, but never contain .NNN

Upon further inspection, other rules become apparent:

  • .NN must be preceded by one or more digits.
  • NNN.NN can be preceded by either ,, , or a simple digit, but nothing else.
  • Anything following .NN and preceding *N.NN marks the end of the match.
  • Finally, the regex needs to consider commas in things like 1,399.66 (1399.66) to determine whether it is a price, but then strip them. 1, 399.66, for instance does not equal 1399.66: it should be 399.66.

I am looking at sed, grep, and awk for a portable and efficient solution. How should I go about approaching this problem?

I found a similar question, but I have no idea how to try the following regex with sed:

^\d+(,\d{1,2})?$

EDIT: Yes, my input format is can be a little weird, because it is the result of the concatenation of scraped pages.

Community
  • 1
  • 1
octosquidopus
  • 3,517
  • 8
  • 35
  • 53
  • 2
    See http://unix.stackexchange.com/a/138937 recommending `grep -o`. Your inputs are in a really awkward format - `Z$1 399.33` should match the number before the space but `25 1399.88` should not match the number before the space? Why - by what rule can that distinction be encoded? What are the rest of the program and data like - can you do a cleanup run or multiple runs through? – TessellatingHeckler Dec 17 '15 at 03:08
  • A space/comma in position `4`, `8`, `12`, etc (RTL). is acceptable, so `1 399.88`, `1 333 399.88`, and `1 133 333 399.88` are OK. Matching `Z$1 399.33` should not be a big deal; the number ends as soon as $ appears (reading RTL, again). – octosquidopus Dec 17 '15 at 03:22
  • Tricky case: `.1399.98` must not be matched, while `1 399.98` corresponds to `1399.98`. But then what about `.1 399.98`. Is the space then significant such that this is matched and the price is `399.98`? I think the requirement should be that the case is resolved by extracting `.1` as a token, where the trailing space terminates the fraction. The next numeric token is `399.98`: good price. – Kaz Dec 17 '15 at 03:51
  • What strictly portable Unix tools are you using to scrape this data? – Kaz Dec 17 '15 at 03:57
  • Good point. Human common sense tells us that `.1 399.98` standing on its own is `1399.98`, but in other contexts, such a permissive rule could lead to false positives, which is why I rejected `.1399.88` in the first place. I'm not sure how to solve this, but `.1 399.98` is rather unlikely anyway. @Kaz curl. I use curl. – octosquidopus Dec 17 '15 at 04:03
  • What I'm getting from this is that the easiest approach is 1. reverse the lines. 2. look for nn.nnn,nnn,...,nn looking for commas or spaces. e.g. `\d\d\.((\d){3}(, ))+\d+` ? It seems much easier to run it backwards. – TessellatingHeckler Dec 17 '15 at 04:21
  • "Human common sense" tells us that your problem isn't strictly well-defined. – tripleee Dec 17 '15 at 04:55

3 Answers3

1

You can use the following shell script:

#/bin/sh
grep -v '\.\d\+\.' | # get rid of lines with multiple dots within the same number
grep -v '\.\d\d\d\+' | # get rid of lines with more than 2 digits after .
sed -e 's/\(.*\.[0-9][0-9]\).*$/\1/' | # remove anything after last .NN
sed -e 's/^.* \([0-9][0-9][0-9][0-9]\)\./\1./' | # "* NNNN." => "NNNN."
sed -e 's/^.* \([0-9][0-9]\)\./\1./' | # "* NN." => "NN."
sed -e 's/^.* \([0-9]\)\./\1./' | # "* N." => "N."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{3,\}\)\./\1\2./g' | # "*,NNN." or "* NNN." => "*NNN."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{6,\}\)\./\1\2./g' | # "*,NNNNNN." or "* NNNNNN." => "*NNNNNN."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{9,\}\)\./\1\2./g' | # "*,NNNNNNNNN." or "* NNNNNNNNN." => "*NNNNNNNNN."
grep -o '\d\+\.\d\d' # print only the price

In case of numbers that are separated by space or , in groups of 3 digits, this solution works up to 9 digits before the .. If you need to extract bigger prices, just add more lines, increasing the number in the regex by 3. ;-)

Put it in a file called extract_prices, make it executable (chmod +x extract_prices) and run it: ./extract_prices < my_list.txt

Tested on OS X using the following input:

CA$1399.00
&#36;1399.11
$1,399.22<
Z$1 399.33
Z$12 777 666.34   # <-- additonal monster price
$1399.44#
C$ 1399.55
1,399.66
1399.77
,1399.88
25 1399.88
399.99
88.88 99.99
.1399.88
666.000

Which generates the following output:

1399.00
1399.11
1399.22
1399.33
12777666.34
1399.44
1399.55
1399.66
1399.77
1399.88
1399.88
399.99
99.99
Patrik
  • 95
  • 9
  • Instead of removing trailing `#` and `<` explicitly, couldn't you simply strip anything that follows the last instance of exactly two numbers preceded by a dot (`.NN`)? – octosquidopus Dec 17 '15 at 03:29
  • True. That part isn't that hard. – Patrik Dec 17 '15 at 03:48
  • Welcome to Stack Overflow, BTW :) – octosquidopus Dec 17 '15 at 03:54
  • Okay, got a solution! Lol. Hacky, but it works. And thanks! Finally trying to get some reputation... lol – Patrik Dec 17 '15 at 04:14
  • How come `echo "$1,399.22" | extract_prices` produces an incorrect `399.22` , while `echo '$1,399.22' | extract_prices` produces a correct `1399.22`? – octosquidopus Dec 20 '15 at 14:53
  • @octosquidopus that's because the shell performs variable expansion in double quoted strings. $1 would be the first argument given to a script or function. in this case, no argument given, so I guess it'll transform the string into `",399.22"` before `extract_prices` even reads it. – Patrik Dec 21 '15 at 16:22
0

A solution with awk that splits on all characters that are not numbers or decimal point and prints the last field that matches a price. The leading sed script handles the exception case #3 where we have a space instead of a comma marking the thousands spot.

sed -e 's/  / x /g; :a; s/\(\$[1-9][0-9]*\) /\1/; ta' | awk -F '[^0-9.]' -v p='[0-9]+\\.[0-9][0-9]' '$0 ~ p { gsub(/,/, ""); for (i=NF; i>0; i--) if ($i ~ "^" p "$") { print $i; next } }'

Notes:

1) The sed script uses a test to iterate; therefore, it can handle millions, billions, etc.
2) The sed script also handles the multiple space condition such that $1[ ][ ]1000.00 does not become $11000.00 in the end.
3) Commas are simply stripped/ignored... if there is an issue with comma separation of numbers, the issue can be resolved by getting rid of the gsub in the awk script and fixing the filter in the leading sed script

Here is a more complicated version that builds on the idea in note #3 to make commas and spaces part of the number only if the space or comma is at a thousands separator.

sed -e ':a; s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' -v p='[0-9]+\\.[0-9][0-9]' '$0 ~ p { for (i=NF; i>0; i--) if ($i ~ "^" p "$") { print $i; next } }'

If chance of success is high on each line, then getting rid of "p" would make for a more efficient script.

sed -e ':a; s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' '{ for (i=NF; i>0; i--) if ($i ~ /^[0-9]+\.[0-9][0-9]$/) { print $i; next } }'

Finally, for safety, we can check in the sed filter to make sure we have a valid space or comma delimited number before we do either substitution.

sed -e ':a; /\$[1-9][0-9]\?[0-9]\?\( [0-9][0-9][0-9]\)\+\.[0-9][0-9]/ s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; /[1-9][0-9]\?[0-9]\?\(,[0-9][0-9][0-9]\)\+\.[0-9][0-9]/ s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' '{ for (i=NF; i>0; i--) if ($i ~ /^[0-9]+\.[0-9][0-9]$/) { print $i; next } }'
Michael Back
  • 1,821
  • 1
  • 16
  • 17
0

This might work for you (GNU sed):

 sed -r '/\n/!s/([^0-9]*\b(([0-9])[ ,]([0-9]{3})|([0-9]+))(\.[0-9]{2})\b)+/\n\3\4\5\6\n/;/^[0-9]+\.[0-9]{2}\b/P;D' file

This works with the data provided but some of the specification is a bit sketchy.

potong
  • 55,640
  • 6
  • 51
  • 83