I have a dataframe that has 5 columns scraped from a site. What I want to do is create an additional column based on the contents of the first two columns, for example, say the data looks like this:
Duration Issues in 1 year
Pay by Annual Recurring Payment 51
Pay every 3 months by Recurring Payment 51
Pay every 6 months by Recurring Payment 51
First 3 issues for £3, then £15 recurring every 6 months thereafter 14
One off payment - Pay for 1 year 14
First 6 issues for £10, then £15 recurring every 6 months thereafter 9
One-Off Payment – Pay for 9 issues 12
One-Off Payment – Pay for 20 issues 51
First year for £29.99, then £20 recurring every 6 months thereafter 13
I want to have an additional column that contains the number of months in the deal based on the 'Duration' string and (when nessecery) calculates the number of months by using the 'Issues in 1 year' column as well.
Ive managed to get want I want for most of them by copying Duration to a new column and using 'str.contains':
df1['Months'] = df1['Duration']
df1.loc[df1['Months'].str.contains('1 year|annual', case=False), 'Months'] = 12
df1.loc[df1['Months'].str.contains('6 months by', case=False), 'Months'] = 6
df1.loc[df1['Months'].str.contains('3 months by', case=False), 'Months'] = 3
The above does seem a little clunky and I feel like there could be a slicker solution, but it works.
When it comes to the Durations that have a fixed cost for the first 3 or 6 issues then im only interested in the number of months for the intial payment, so have used:
df1.loc[df1['Months'].str.contains('first 3', case=False), 'Months'] = round((12 / df1.Issues) * 3,0)
The above does appear to be working but could be more efficient.
Im now super stuck for the 'Pay for x issues' type. I need to be able to identify the strings with that pattern and then also use the number within it to calculate the answer, I have tried to following applying the same methodology as before but using extract but I get and unexpected keyword arguement 'case':
df1.loc[df1['Months'].str.contains('Pay for (.+?) issues', case=False), 'Months'] = round((12 / df1.Issues) * df1.loc[df1['Months'].str.extract('Pay for (.+?) issues', case=False), 'Months'],0)
Im not sure if my regex logic is correct as im still getting to grips with it but I copied it from this post.
To (try and) simplfy; I am trying to achieve:
If 'One-Off Payment – Pay for 20 issues' contains '...Pay for x issues...' = 12 / Issues(51) * 20
Which would give an end result of:
Duration Issues in 1 year Months
One-Off Payment – Pay for 20 issues 51 5
Also if there is a simple way of doing the above I assume the logic could be applied to the 'Pay every x months...' strings.
Any help would be super appreciated, I am new and have tried to find an answer for days but without results.