0

I have a large dataframe (50+ total columns) that has a "Project_Type" column with 5 different types of projects available. The projects types can be "Project Type 1", "Project Type 2", "Project Type 3", "Project Type 4", or "Project Type 5". I have other columns with various performance measures (all integers) so I believe I need to normalize each "Project_Type" in a new column to be either 1 (if true) or 0 (if false) and then I can run .corr() over the project types and performance measures to see if there are any correlations (such as certain project types costing more, making more of an impact, etc)

I can create 5 new blank columns manually doing

df['Proj1Normalize'] = ""
df['Proj2Normalize'] = ""

etc...

and then get a value of 1 or 0 based on true or false, but is there a quicker way to add a large list of blank columns at once that have specific titles? This example is easy to do manually, but I have run into problems where I need to make 20+ new "normalized" columns at once and it is too time consuming to manually create them all.

It would also help if someone could explain an efficient way to normalize one column with multiple different values at once.

I tried df['Proj1Normalize', 'Proj2Normalize', 'Proj3Normalize, etc] = "" but that wouldn't work. I tried referring to this - Add multiple empty columns to pandas DataFrame - but i dont want my columns to just be names one character names as in the first example.

Example:

Right now I have:

ProjectType  Dollars_Spent  Employees

0     Proj 1     1000     10
1     Proj 2     1800     12
2     Proj 1     800      14
3     Proj 3     980       5

and i want to have:

ProjectType   Dollars_Spent   Employees   Proj1   Proj 2   Proj3

0     Proj 1     1000     10     1     0     0
1     Proj 2     1800     12     0     1     0
2     Proj 1     800      14     1     0     0
3     Proj 3     980       5     0     0     1

Any help would be great.

ldz
  • 2,217
  • 16
  • 21
eluth
  • 69
  • 2
  • 13
  • Can you provide example? and expected output – Poonam Aug 23 '17 at 04:07
  • A little hard to understand what you expected. Give us some well-illustrated examples. – rojeeer Aug 23 '17 at 04:37
  • @Poonam @ rojeeer i put in an example. Again, I understand how to do this manually by creating a new column line by line, but that can be time consuming when you need to add a ton of columns so i'm wondering the best way to create a bunch of blank columns all at once that have specific titles – eluth Aug 23 '17 at 13:09

2 Answers2

0

If your goal is to encode the categorical columns into 1/0, you can use pandas.get_dummy to do it. For example:

df = pd.DataFrame({'Type':[1,2,3,2]})
new_df = pd.get_dummies(df,columns=['Type'])

Out[6]: 
    Type_1  Type_2  Type_3
0     1.0     0.0     0.0
1     0.0     1.0     0.0
2     0.0     0.0     1.0
3     0.0     1.0     0.0
nnvutisa
  • 71
  • 6
  • Thank you! I didn't know you could run get_dummies on a column with multiple string values. This made a new column for each possible string combination so I didn't even have to initialize any new columns – eluth Aug 23 '17 at 13:16
0
import pandas

df = pandas.DataFrame(data={'ProjectType':['Proj 1','Proj 2','Proj 1','Proj 3'], 'Dollars_Spent':[1000, 1800,800,980], 'Employees':[10, 12, 14, 5]},columns=('ProjectType','Dollars_Spent','Employees'))

df_New = pandas.concat([df, pandas.get_dummies(df['ProjectType'])], axis=1)
print(df_New)

  ProjectType  Dollars_Spent  Employees  Proj 1  Proj 2  Proj 3
0      Proj 1           1000         10       1       0       0
1      Proj 2           1800         12       0       1       0
2      Proj 1            800         14       1       0       0
3      Proj 3            980          5       0       0       1

If there is no need of ProjectType column, then can use: del df_New ['ProjectType']

If you want to find additional information regarding get_dummies, please check https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

Poonam
  • 669
  • 4
  • 14