0

In order to search correlations between products and categories and next visualizations (heatmaps) I need to reorder array using Python with/without Pandas or other libraries from this:

Book Name, Category 1, Category 2, Category 3, Django 101 Python Web-Dev Beginner ROR Guide Rails Web-Dev Intermediate Laravel PHP Web-Dev Intermediate

into that:

Book Name, Python, Web-Dev, Beginner, Rails, PHP, Intermediate Django 101 True True True False False, False ROR Guide False True False False False, True Laravel False True False False True, True

Is there any way to do that? Data stored into .csv file and read by pandas.read_csv ()

Sergei
  • 401
  • 2
  • 6
  • 10

1 Answers1

2

This can be done using the get_dummies function in Pandas.

df = pd.DataFrame({'Book Name': ['Django 101', 'ROR Guide', 'Laravel'], 'Category 1': ['Python', 'Rails', 'PHP'], 'Category 2': ['Web-Dev']*3, 'Category 3': ['Beginner', 'Intermediate', 'Intermediate']})

dummies = pd.concat([pd.get_dummies(df[c]) for c in df.columns[1:]], axis=1)
df_new = pd.concat([df['Book Name'], dummies], axis=1)

>>> df_new
    Book Name  PHP  Python  Rails  Web-Dev  Beginner  Intermediate
0  Django 101    0       1      0        1         1             0
1   ROR Guide    0       0      1        1         0             1
2     Laravel    1       0      0        1         0             1

Or you can reset the index of the DataFrame to the Book's name:

df.set_index('Book Name', inplace=True)
df_new = pd.concat([pd.get_dummies(df[c]) for c in df], axis=1)
>>> df_new
            PHP  Python  Rails  Web-Dev  Beginner  Intermediate
Book Name                                                      
Django 101    0       1      0        1         1             0
ROR Guide     0       0      1        1         0             1
Laravel       1       0      0        1         0             1
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • Unfortunately I have data like that: ` Book Name, Category 1, Category 2, Category 3, Django 101 Python Web-Dev Beginner ROR Guide Rails Intermediate Web-Dev Laravel Beginner Web-Dev PHP ` so it produces column duplicates – Sergei Jun 19 '15 at 14:19
  • Does not work exactly right since categories can be mixed like that so it will produce more duplications `df = pd.DataFrame({'Book Name': ['Django 101', 'ROR Guide', 'Laravel'], 'Category 1': ['Python', 'Intermediate', 'PHP'], 'Category 2': ['Web-Dev', 'Web-Dev', 'Intermediate'], 'Category 3': ['Beginner', 'Rails', 'Web-Dev']})` Is there any way to avoid columns duplications? – Sergei Jun 19 '15 at 15:51
  • @sergei It is up to you to define the categorization. To ensure uniqueness across categories, you can prepend each name in the column with an identifier, e.g. cat1_beginner will be different than cat2_beginner. – Alexander Jun 19 '15 at 16:46