0

I have a dataframe called data, I am trying to clean one of the columns in the dataframe so I can convert the price into numerical values only. This is how I'm filtering for the column to find those incorrect values. data[data['incorrect_price'].astype(str).str.contains('[A-Za-z]')]

    Incorrect_Price    Occurences   errors
23  99 cents                732       1
50  3 dollars and 49 cents  211       1
72  the price is 625        128       3
86  new price is 4.39       19        2
138 4 bucks                 3         1
199 new price 429           13        1
225 price is 9.99           5         1
240 new price is 499        8         2  

I have tried data['incorrect_Price'][20:51].str.findall(r"(\d+) dollars") and data['incorrect_Price'][20:51].str.findall(r"(\d+) cents") to find rows that have "cents" and "dollars" in them so I can extract the dollar and cents amount but haven't been able to incorporate this when iterating over all rows in the dataframe.

  I would like the results to like look this:  

    Incorrect_Price        Desired    Occurences    errors
23  99 cents                .99           732         1
50  3 dollars and 49 cents  3.49          211         1
72  the price is 625        625           128         3
86  new price is 4.39       4.39           19         2
138 4 bucks                 4.00           3          1
199 new price 429           429            13         1
225 price is 9.99           9.99           5          1
240 new price is 499        499            8          2
TH14
  • 622
  • 10
  • 24

1 Answers1

2

The task can be relatively easily solved as long as the strings Incorrect_Price retain the structure you present in the examples (numbers are not expressed in words).

Using regular expressions you can extract number part and optional "cent"/"cents" or "dollar"/"dollars" using an approach from similar SO question. The two main differences is that you are looking for pairs of numerical value and "cent[s]" or "dollar[s]" and that they potentially occur more than once.

import re


def extract_number_currency(value):
    prices  = re.findall('(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?', value)

    result = 0.0
    for value, currency in prices:
        partial = float(value)
        if currency == 'cent':
            result += partial / 100
        else:
            result += partial

    return result


print(extract_number_currency('3 dollars and 49 cent'))
3.49

Now, what you need is to apply this function to all incorrect values in the column with prices in words. For simplicity I am applying it here to all values (but I am sure you will be able to deal with the subset):

data['Desired'] = data['Incorrect_Price'].apply(extract_number_currency)

Voila!


Breaking down of the regex '(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?'

There are two capture named groups (?P<name_of_the_capture_group> .... )

The first capture group (?P<value>[\d]*[.]?[\d]{1,2}) captures:

[\d] - digits

[\d]* - repeated 0 or more times

[.]? - followed by optional (?) dot

[\d]{1,2} - followed by a digit repeated from 1 to 2 times

\s* - denotes 0 or more whitespaces

Now the 2nd capture group which is much simpler: (?P<currency>cent|dollar)

cent|dollar - it boils down to alternative between cent and dollar strings being captured

s? is an optional plural of 'cent s' or 'dollar s'

sophros
  • 14,672
  • 11
  • 46
  • 75
  • wow that is super impressive. thank you so much! is there anyway you could provide a breakdown of the regex logic please? – TH14 Feb 03 '19 at 05:41