0

I have a dataframe object as follows:

'Name' 'Book' 'Rating'

'John' 'Moby Dick' 5

'John' 'The Alchemist' 3

'John' 'The Da Vinci Code' 4

'Peter' 'Moby Dick' 4

'Peter' 'Lolita' 5

'Mike' 'The Alchemist' 4

'Mike' 'The Da Vinci Code' 3

I want to create a 2-d array from this data of size [len('Name')][len('Book')] and the values of the array should be the ratings. If a person has not rated a book, value should be zero.

5 3 0 4

4 0 5 0

0 4 0 3

Here, my first row corresponds to John, next Peter and last row Mike and 1st column corresponds to Moby Dick, 2nd to The Alchemist, 3rd to Lolita and 4th to The Da Vinci Code.

I have tried the following where I scan the entire object (twice I believe, to get a match for each column).

import pandas
import numpy

df = read_from_file

uName = df['Name'].unique().tolist()
uBook = df['Book'].unique().tolist()
m = numpy.zeros((len(uName), len(uBook)))

for i in range(0, len(uName)):
    for j in range(0, len(uBook)):
        x = df.loc[(df['Name'] == uName[i]) & (df['Book'] == uBook[j])]
        if x.empty:
            m[i][j] = 0
        else:
            m[i][j] = x.Rating

It gives me the right results but it's extremely inefficient. My dataframe is huge - ~50,000 rows, and it takes forever to run this piece of code. Can you help me with a more efficient way to achieve this? Thanks.

Nihal
  • 5,262
  • 7
  • 23
  • 41
Gengis Khan
  • 163
  • 1
  • 2
  • 10

2 Answers2

0
>>> import pandas as pd
>>> data = pd.DataFrame({'x': ['A', 'A', 'B','B', 'C'], 'y':['q','p','q', 'p', 'q'], 'r': [1, 2, 3, 4, 5]})
>>> data.pivot(values='r', columns='Name', index='Book')
x    A    B    C
y               
p  2.0  4.0  NaN
q  1.0  3.0  5.0

or

>>> data.pivot(values='r', columns='x', index='y').fillna(0).values
array([[2., 4., 0.],
       [1., 3., 5.]])

In case of your data frame the code should be something like this:

data.pivot(values='Rating', columns='x', index='y').fillna(0).values
bubble
  • 1,634
  • 12
  • 17
0

data(temp.csv):

Name,Book,Rating
John,Moby Dick, 5
John,The Alchemist, 3
John,The Da Vinci Code, 4
Peter,Moby Dick, 4
Peter,Lolita, 5
Mike,The Alchemist, 4
Mike,The Da Vinci Code, 3

code:

df = pd.read_csv('temp.csv')
print(df)
print(df.pivot(values='Rating', columns='Book', index='Name').fillna(0))

output

Book   Lolita  Moby Dick  The Alchemist  The Da Vinci Code
Name                                                      
John      0.0        5.0            3.0                4.0
Mike      0.0        0.0            4.0                3.0
Peter     5.0        4.0            0.0                0.0
Nihal
  • 5,262
  • 7
  • 23
  • 41