0

I have a DataFrame, and I want to count the time of each id uses each app_id . As both the number of id and the number of app_id are large, I want to use sparse.csr_matrix to store it.

Input:

import pandas as pd 
import numpy as np
import random, string
def randomword(length):
    letters = string.ascii_lowercase
    nums = np.arange(1000)
    appList=[]
    for i in range(length):  
        appList.append(''.join([random.choice(letters),
        str(random.choice(nums))]))
    return appList
appList= list(randomword(300000))
timeList= [random.randrange(0, 10000, 1) for _ in range(300000)]
idList= [random.randrange(0, 70000, 1) for _ in range(300000)]

df= pd.DataFrame({'id':idList, 'app_id': appList, 'time': timeList})
print(df.head())
print('idList length:',len(set(idList)))
print('appList length:',len(set(appList)))

Output:

      id app_id  time
0  64365   c789  7366
1  54623   a391  3080
2  58511   m570  9091
3  37657   m108  4707
4   1343   m771   973

idList length: 69062
appList length: 26000

Expected:

For convenience, I use df.head() as an example. The following DataFrame is what I want to get. And I expect the DataFrame to be stored as a csr_matrix.

      id   c789  a391  m570  m108  m771
0  64365   7366    0     0     0     0
1  54623      0  3080    0     0     0
2  58511      0    0   9091    0     0
3  37657      0    0     0  4707     0
4   1343      0    0     0     0   973

As you can see, the number of id is 69062 and the number of app_id is 26000 , so I expect to get a csr_matrix with shape (69062,26001).

rosefun
  • 1,797
  • 1
  • 21
  • 33

0 Answers0