0

I have an Excel file which includes huge dataset (21157 rows x 32 columns). The first column is called 'user_id' such that hundreds of rows are labeled with the same user id (assume from row0 to row 200 indexed with 'user1', and from row201 to row500 for user2, and so on). My goal is to extract all the rows and all the columns data that only correspond to a specific user in order to get the data of each user separately based on the user id. The excel file is organized as follows.

               user_id   col2 col3 .....col32
      row0     user1      20    10  ..... 76
      row1     user1      22    20  ..... 65
      row2     user1      45    30  ..... 45
      row3     user1      43    56  ..... 67
       .
       .
       .
     row200    user2      20    10  ..... 76
     row201    user2      22    20  ..... 65
     row202    user2      45    30  ..... 45
     row202    user2      43    56  ..... 67
       .
       .
       .

I tried to read the whole excel file using pandas as follows:

  df = pd.read_excel('file_name.xlsx')     # data freams
  user_id_col = df['user_id'] # extract user_id column
  number_of_users = [len(list(group)) for key, group in 
  groupby(user_id_col)] # count number_of_users in the dataset (we got 41 users)

 Orignal_data_matrix = df.values # convert datafram into matrix
 print(Orignal_data_matrix)
 print(Orignal_data_matrix.shape)

Can anyone tell me how I can extract all the columns data (i.e. from col2 to col 32) of multiple rows of each user separately based on his same user id (i.e., 1st col)?

Mohsen Ali
  • 655
  • 1
  • 9
  • 30
  • Have you looked into [openpyxl](https://openpyxl.readthedocs.io/en/stable/tutorial.html) – cssyphus Jul 01 '19 at 02:45
  • Have you seen https://stackoverflow.com/a/17071908/5386938 which says you can use something (untested) like: `df.loc[df['user_id'] == 'user1']` –  Jul 01 '19 at 02:53
  • Yes, it works using df.loc function, Thanks a lot – Mohsen Ali Jul 01 '19 at 03:04

0 Answers0