1

I am currently working on a data science project. The Idea is to clean the data from "glassdoor_jobs.csv", and present it in a much more understandable manner.

import pandas as pd

df = pd.read_csv('glassdoor_jobs.csv')

#salary parsing
#Removing "-1" Ratings
#Clean up "Founded"
#state field
#Parse out job description

df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0)
df = df[df['Salary Estimate'] != '-1']
Salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = Salary.apply(lambda x: x.replace('K', '').replace('$',''))

minus_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', ''))

df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(x.split('-')[1]))

I am getting the error at that last line. After digging a bit, I found out in minus_hr, some of the 'Salary Estimate' only has one number instead of range:

index Salary Estimate
0 150
1 58
2 130
3 125-150
4 110-140
5 200
6 67- 77

And so on. Now I'm trying to figure out how to work around the "list index out of range", and make max_salary the same as the min_salary for the cells with only one value.

I am also trying to get average between the min and max salary, and if the cell only has a single value, make that value the average

So in the end, something like index 0 would look like:

index min max average
0 150 150 150
sgy0003
  • 43
  • 5

3 Answers3

0

Test the length of x.split('-') before accessing the elements.

salaries = x.split('-')
if len(salaries) == 1:
    # only one salary number is given, so assign the same value to min and max 
    df['min_salary'] = df['max_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
else:
    # two salary numbers are given
    df['min_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
    df['max_salary'] = minus_hr.apply(lambda x: int(salaries[1]))
John Gordon
  • 29,573
  • 7
  • 33
  • 58
0

You'll have to add in a conditional statement somewhere.

df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]) if '-' in x else x)

The above might do it, or you can define a function.

def max_salary(cell_value):
    if '-' in cell_value:
        max_salary = split(cell_value, '-')[1]
    else:
        max_salary = cell_value
return max_salary

df['max_salary'] = minus_hr.apply(lambda x: max_salary(x))


def avg_salary(cell_value):
    if '-' in cell_value:
        salaries = split(cell_value,'-')
        avg = sum(salaries)/len(salaries)
    else:
        avg = cell_value
return avg

df['avg_salary'] = minus_hr.apply(lambda x: avg_salary(x))

Swap in min_salary and repeat

ciaran haines
  • 294
  • 1
  • 11
  • So following your first example, I got the min and max. What should I do about average? Apparently division by 2 is impossible at the current state – sgy0003 Aug 29 '21 at 01:01
  • updated. If this works for you can you mark it as an answer? I've never been an answerer before for coding :) – ciaran haines Aug 29 '21 at 08:45
  • 1
    So turns out average salary part was easier than I realized; All I had to do was the following: `df['average_salary'] = (df.min_salary.astype(int) + df.max_salary.astype(int))/2` But thanks for answering. min and max was a real headache and you helped me out! – sgy0003 Aug 30 '21 at 16:00
  • Winner! I hate it when you get stuck with an idea that should be simple but you can't quite figure out. On the flip side, it's lovely if you can figure it out. Cheers for the mark up :D – ciaran haines Aug 31 '21 at 09:18
0

If you want to avoid .apply()...

Try:

import numpy as np

# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
sals =  df['Salary Estimate'].str.extractall(r'(?P<min_salary>\d+)[^0-9]*(?P<max_salary>\d*)?')

# reset the new frame's index
sals = sals.reset_index()

# join the extracted min/max salary columns to the original dataframe and fill any blanks with nan
df = df.join(sals[['min_salary', 'max_salary']].fillna(np.nan))

# fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])

# set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)

# calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)

# see what you've got
print(df)

Or without using regex:

import numpy as np

# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
df['sals'] =  df['Salary Estimate'].str.split('-')

# expand the list in sals to two columns filling with nan
df[['min_salary', 'max_salary']] = pd.DataFrame(df.sals.tolist()).fillna(np.nan)

# delete the sals column
del df['sals']

# # fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])

# # set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)

# # calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)

# see you've got
print(df)

Output:

  Salary Estimate  min_salary  max_salary  average_salary
0             150         150         150             150
1              58          58          58              58
2             130         130         130             130
3         125-150         125         150             137
4         110-140         110         140             125
5             200         200         200             200
6          67- 77          67          77              72
MDR
  • 2,610
  • 1
  • 8
  • 18