Finding a substring then using digits within that string to calculate a new column in pandas

Question

I have a dataframe that has 5 columns scraped from a site. What I want to do is create an additional column based on the contents of the first two columns, for example, say the data looks like this:

Duration                                                               Issues in 1 year
Pay by Annual Recurring Payment                                         51
Pay every 3 months by Recurring Payment                                 51
Pay every 6 months by Recurring Payment                                 51
First 3 issues for £3, then £15 recurring every 6 months thereafter     14
One off payment - Pay for 1 year                                        14
First 6 issues for £10, then £15 recurring every 6 months thereafter     9
One-Off Payment – Pay for 9 issues                                      12
One-Off Payment – Pay for 20 issues                                     51
First year for £29.99, then £20 recurring every 6 months thereafter     13

I want to have an additional column that contains the number of months in the deal based on the 'Duration' string and (when nessecery) calculates the number of months by using the 'Issues in 1 year' column as well.

Ive managed to get want I want for most of them by copying Duration to a new column and using 'str.contains':

df1['Months'] = df1['Duration']
df1.loc[df1['Months'].str.contains('1 year|annual', case=False), 'Months'] = 12
df1.loc[df1['Months'].str.contains('6 months by', case=False), 'Months'] = 6
df1.loc[df1['Months'].str.contains('3 months by', case=False), 'Months'] = 3

The above does seem a little clunky and I feel like there could be a slicker solution, but it works.

When it comes to the Durations that have a fixed cost for the first 3 or 6 issues then im only interested in the number of months for the intial payment, so have used:

df1.loc[df1['Months'].str.contains('first 3', case=False), 'Months'] = round((12 / df1.Issues) * 3,0)

The above does appear to be working but could be more efficient.

Im now super stuck for the 'Pay for x issues' type. I need to be able to identify the strings with that pattern and then also use the number within it to calculate the answer, I have tried to following applying the same methodology as before but using extract but I get and unexpected keyword arguement 'case':

df1.loc[df1['Months'].str.contains('Pay for (.+?) issues', case=False), 'Months'] = round((12 / df1.Issues) * df1.loc[df1['Months'].str.extract('Pay for (.+?) issues', case=False), 'Months'],0)

Im not sure if my regex logic is correct as im still getting to grips with it but I copied it from this post.

To (try and) simplfy; I am trying to achieve:

If 'One-Off Payment – Pay for 20 issues' contains '...Pay for x issues...' = 12 / Issues(51) * 20

Which would give an end result of:

Duration                                  Issues in 1 year      Months
One-Off Payment – Pay for 20 issues       51                    5

Also if there is a simple way of doing the above I assume the logic could be applied to the 'Pay every x months...' strings.

Any help would be super appreciated, I am new and have tried to find an answer for days but without results.

YOLO · Answer 1 · 2018-02-23T12:33:47.983

Assuming 'Pay for x issues' statements doesn't contain any other number, you can try this.

import re
import pandas as pd

## sample data frame
df = pd.DataFrame({'Duration':['Pay by Annual Recurring Payment',                                         
'Pay every 3 months by Recurring Payment',                               
'Pay every 6 months by Recurring Payment',                               
'First 3 issues for £3, then £15 recurring every 6 months thereafter',
'One off payment - Pay for 1 year',
'First 6 issues for £10, then £15 recurring every 6 months thereafter',
'One-Off Payment – Pay for 9 issues',                                 
'One-Off Payment – Pay for 20 issues',  
'First year for £29.99, then £20 recurring every 6 months thereafter'], 'Issues_in_1_year' : [51, 51, 51,14,14,9,12,51,13]  })

## extract month and pay value in separate columns
df['Months'] = df['Duration'].str.extract('(\d+) months by').fillna(-1).astype(int)
df.loc[df['Duration'].str.contains('(\d+) year| (\d+) annual | Annual'),'Months'] = 12
df['Pay_Value'] = df['Duration'].str.extract('Pay for (\d+)').fillna(-1).astype(int)

## calculate solution
def get_sol(row):
    if row.Months == -1 and row.Pay_Value == -1:
         return 0
    elif row.Months != -1 and row.Pay_Value == -1:
        return round((12/ row.Issues_in_1_year) * row.Months)
    elif row.Months == -1 and row.Pay_Value != -1:
        return round((12/ row.Issues_in_1_year) * row.Pay_Value) 

df['solution'] = df.apply(get_sol, axis=1)
print(df)

And, the output looks like this where solution is the column we have calculated (few rows):

    Duration                                 Issues_in_1_year   Months  Pay_Value   solution
0   Pay by Annual Recurring Payment                 51           12        -1       3
1   Pay every 3 months by Recurring Payment         51            3        -1       1
2   Pay every 6 months by Recurring Payment         51            6        -1       1
3   One-Off Payment – Pay for 20 issues             51           -1        20       5

Hi @saraswatmks, thanks for taking the time to respond. I'm not sure you've understood what im asking for or if I have explained it correctly as your output isnt the result I'm trying to get to and I'm not sure what the Pay_Value relates to? The number of months should be based on the duration text, so for row 0 the ouput should be 12 months not -1. And row 3 should be 5 not -1. — G-Money, Feb 23 '18 at 12:19
I've updated the code. I see you want the result as integers instead of floats so now the values are rounded off. Months refer to the extracted value of month from string. Pay_Value refers to the extracted value from 'Pay for x issues' string. Just to keep it simple to understand, I extracted both values in separate columns. 'Solution' refer to the final calculated value as you mentioned above. — YOLO, Feb 23 '18 at 12:35
Hi Manish, thanks for your help, I have managed to put this into my code and got it working with the function, now I just need to run an if statement to merge all the relevant data into the months column and I cant work out how to do that successfully by iterating through each line on the dataframe. I have tried `if df1['Months'] == -1: df1['Months']=df1[solution] else: pass` but this is giving me an error. — G-Money, Feb 27 '18 at 22:12

Finding a substring then using digits within that string to calculate a new column in pandas

1 Answers1