I have an Excel file which includes huge dataset (21157 rows x 32 columns). The first column is called 'user_id' such that hundreds of rows are labeled with the same user id (assume from row0 to row 200 indexed with 'user1', and from row201 to row500 for user2, and so on). My goal is to extract all the rows and all the columns data that only correspond to a specific user in order to get the data of each user separately based on the user id. The excel file is organized as follows.
user_id col2 col3 .....col32
row0 user1 20 10 ..... 76
row1 user1 22 20 ..... 65
row2 user1 45 30 ..... 45
row3 user1 43 56 ..... 67
.
.
.
row200 user2 20 10 ..... 76
row201 user2 22 20 ..... 65
row202 user2 45 30 ..... 45
row202 user2 43 56 ..... 67
.
.
.
I tried to read the whole excel file using pandas as follows:
df = pd.read_excel('file_name.xlsx') # data freams
user_id_col = df['user_id'] # extract user_id column
number_of_users = [len(list(group)) for key, group in
groupby(user_id_col)] # count number_of_users in the dataset (we got 41 users)
Orignal_data_matrix = df.values # convert datafram into matrix
print(Orignal_data_matrix)
print(Orignal_data_matrix.shape)
Can anyone tell me how I can extract all the columns data (i.e. from col2 to col 32) of multiple rows of each user separately based on his same user id (i.e., 1st col)?