0

Let's say we have a list of employees and some other data:

  Employee   Location   Title
0        1  Location1  Title1
1        2  Location2  Title1
2        3  Location3  Title2
3        4  Location1  Title3
4        5  Location1  Title2

I am transposing it to features and labels with (1,0) values, and it works but it takes ages on databased of 6k records. Logic: take value from Location, make it a column, if employees location matches column put 1, else put 0.

My question: is it possible to optimize performance somehow? My lacking of terminology makes it difficult to find the better solution, but I believe that something should be there.

The final output looks like this:

 Employee  Location1  Location2  Location3  Title1  Title2  Title3
0        1          1          0          0       1       0       0
1        2          0          1          0       1       0       0
2        3          0          0          1       0       1       0
3        4          1          0          0       0       0       1
4        5          1          0          0       0       1       0

The working code that takes ages to complete:

import pandas as pd
df = pd.DataFrame.from_dict({'Employee': ['1','2','3','4','5'], 
      'Location': ['Location1', 'Location2','Location3','Location1','Location1'],
      'Title': ['Title1','Title1','Title2','Title3','Title2']
     })
df_tr = df['Employee'] #temporary employee ids

# transposing the data, which takes ages:

df_newcols = {}
for column in list(df)[1:]:
    newcols = df[column].unique()
    for key in newcols:
        temp_ar = []
        for value in df[column]:
            if key == value:
                temp_ar.append(1)
            else:
                temp_ar.append(0)
        df_newcols[key] = temp_ar
print (df_newcols)

# adding transposed to the temp df

df_temp = pd.DataFrame.from_dict(df_newcols)

# merging with df with employee ids

new_df = pd.concat([df_tr,df_temp],axis=1)
  • Use [`pd.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html). – jpp Jan 17 '20 at 00:49

3 Answers3

2

This should do the trick:

df["_dummy"]=1
df2=pd.concat([
    df.pivot_table(index="Employee", columns="Location", values="_dummy", aggfunc=max), 
    df.pivot_table(index="Employee", columns="Title", values="_dummy", aggfunc=max)
], axis=1).fillna(0).astype(int).reset_index(drop=False)

Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
2

Another solution using pd.get_dummies():

print( pd.concat([df['Employee'],
                  pd.get_dummies(df['Location']),
                  pd.get_dummies(df['Title'])], axis=1) )

Prints:

  Employee  Location1  Location2  Location3  Title1  Title2  Title3
0        1          1          0          0       1       0       0
1        2          0          1          0       1       0       0
2        3          0          0          1       0       1       0
3        4          1          0          0       0       0       1
4        5          1          0          0       0       1       0
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

You should try to use more the "apply" methods and the methods of pandas in general. Using "for loops" in pandas is very bad... it kills your performance.

A possible solution is the following:

import pandas as pd


# read the file
emp=pd.read_csv("employee_huge.txt", sep=" ")


# generate unique lists containing LocationX and TitleX
lnewcols_location=set(emp["Location"].to_list())
lnewcols_title=set(emp["Title"].to_list())


# a function to compare a cell (like "Location1") to a string that is the name of the column
# like "Location2".  If they match return 1, otherwise 0
def same_as_col(acell, col):
    if(acell==col):
        return(1)
    else:
        return(0)


# generate all the LocationN columns with 1 or 0 if there is a match
for i in lnewcols_location:
  emp[i]=emp["Location"].apply(same_as_col, col=i)

# generate all the TitleN columns with 1 or 0 if there is a match
for i in lnewcols_title:
  emp[i]=emp["Title"].apply(same_as_col, col=i)

# removing Location and Title columns
emp=emp.drop(["Location", "Title"], axis=1)

A final note, I generated a file called employee_huge.txt. It's content is formatted like the following:

Employee Location Title
0 Location4 Title1
1 Location1 Title3
2 Location1 Title2
3 Location1 Title4
4 Location4 Title1
Fabrizio
  • 927
  • 9
  • 20
  • 2
    All correct, except- you should avoid ```.apply(...)```: https://stackoverflow.com/a/54432584/11610186 – Grzegorz Skibinski Jan 17 '20 at 13:59
  • 1
    Thanks for the comment, I am also new of pandas so my solution is not optimal like yours. Although It works in a second. I will replace apply with list comprehension when possible in the future. – Fabrizio Jan 17 '20 at 14:30