I am currently working on a data science project. The Idea is to clean the data from "glassdoor_jobs.csv", and present it in a much more understandable manner.
import pandas as pd
df = pd.read_csv('glassdoor_jobs.csv')
#salary parsing
#Removing "-1" Ratings
#Clean up "Founded"
#state field
#Parse out job description
df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0)
df = df[df['Salary Estimate'] != '-1']
Salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = Salary.apply(lambda x: x.replace('K', '').replace('$',''))
minus_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', ''))
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(x.split('-')[1]))
I am getting the error at that last line. After digging a bit, I found out in minus_hr, some of the 'Salary Estimate' only has one number instead of range:
index | Salary Estimate |
---|---|
0 | 150 |
1 | 58 |
2 | 130 |
3 | 125-150 |
4 | 110-140 |
5 | 200 |
6 | 67- 77 |
And so on. Now I'm trying to figure out how to work around the "list index out of range", and make max_salary the same as the min_salary for the cells with only one value.
I am also trying to get average between the min and max salary, and if the cell only has a single value, make that value the average
So in the end, something like index 0 would look like:
index | min | max | average |
---|---|---|---|
0 | 150 | 150 | 150 |