Extract numbers from strings from a column in pandas dataframe

Question

I have a dataframe called data, I am trying to clean one of the columns in the dataframe so I can convert the price into numerical values only. This is how I'm filtering for the column to find those incorrect values. data[data['incorrect_price'].astype(str).str.contains('[A-Za-z]')]

    Incorrect_Price    Occurences   errors
23  99 cents                732       1
50  3 dollars and 49 cents  211       1
72  the price is 625        128       3
86  new price is 4.39       19        2
138 4 bucks                 3         1
199 new price 429           13        1
225 price is 9.99           5         1
240 new price is 499        8         2

I have tried data['incorrect_Price'][20:51].str.findall(r"(\d+) dollars") and data['incorrect_Price'][20:51].str.findall(r"(\d+) cents") to find rows that have "cents" and "dollars" in them so I can extract the dollar and cents amount but haven't been able to incorporate this when iterating over all rows in the dataframe.

  I would like the results to like look this:  

    Incorrect_Price        Desired    Occurences    errors
23  99 cents                .99           732         1
50  3 dollars and 49 cents  3.49          211         1
72  the price is 625        625           128         3
86  new price is 4.39       4.39           19         2
138 4 bucks                 4.00           3          1
199 new price 429           429            13         1
225 price is 9.99           9.99           5          1
240 new price is 499        499            8          2

For starters, there is no column 'Shopper_Entered_Price' in your example. — DYZ, Feb 02 '19 at 06:57

sophros · Accepted Answer · 2019-02-03T06:21:11.917

The task can be relatively easily solved as long as the strings Incorrect_Price retain the structure you present in the examples (numbers are not expressed in words).

Using regular expressions you can extract number part and optional "cent"/"cents" or "dollar"/"dollars" using an approach from similar SO question. The two main differences is that you are looking for pairs of numerical value and "cent[s]" or "dollar[s]" and that they potentially occur more than once.

import re


def extract_number_currency(value):
    prices  = re.findall('(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?', value)

    result = 0.0
    for value, currency in prices:
        partial = float(value)
        if currency == 'cent':
            result += partial / 100
        else:
            result += partial

    return result


print(extract_number_currency('3 dollars and 49 cent'))

3.49

Now, what you need is to apply this function to all incorrect values in the column with prices in words. For simplicity I am applying it here to all values (but I am sure you will be able to deal with the subset):

data['Desired'] = data['Incorrect_Price'].apply(extract_number_currency)

Voila!

Breaking down of the regex '(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?'

There are two capture named groups (?P<name_of_the_capture_group> .... )

The first capture group (?P<value>[\d]*[.]?[\d]{1,2}) captures:

[\d] - digits

[\d]* - repeated 0 or more times

[.]? - followed by optional (?) dot

[\d]{1,2} - followed by a digit repeated from 1 to 2 times

\s* - denotes 0 or more whitespaces

Now the 2nd capture group which is much simpler: (?P<currency>cent|dollar)

cent|dollar - it boils down to alternative between cent and dollar strings being captured

s? is an optional plural of 'cent s' or 'dollar s'

wow that is super impressive. thank you so much! is there anyway you could provide a breakdown of the regex logic please? — TH14, Feb 03 '19 at 05:41

Extract numbers from strings from a column in pandas dataframe

1 Answers1