-2

Situation:

1. all_task_usage_10_19

all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns. There are multiple rows with the same ID inside the column machine_ID with different values in other columns.

Columns:

'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'

2. clustering code

I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas


3. Output

Output displayed using: with option_context as it allows to better visualise the content


My Aim:

I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.

Something like this.

Ishaan_B
  • 9
  • 6
  • 3
    Don't put your code/data in images. Type the code out in a code block and add a means of reproducing it (code that creates a small sample of your data) or at least a table of your data typed out (e.g. markdown format) – Michael S. Aug 12 '22 at 21:57
  • What happened when you ran your code, what was the expected output, what do you want to do better? – Michael S. Aug 12 '22 at 22:09
  • What operation do you want to perform on multiple values of the same column (with the same `machine_ID`) to combine them into one value? – Vladimir Fokow Aug 12 '22 at 22:11
  • Dear @MichaelS. Thank you for your insights on how to post questions, I would certainly keep this in mind. When I run my code, I get the output: "3. Output" of my question (I have attached a picture) – Ishaan_B Aug 12 '22 at 22:19
  • Dear @VladimirFokow, I want to generate a single row using machine_ID i.e. for eg. If I have multiple rows with the machine_ID "8910" with different values in all the other columns, I would like to cluster all these rows into a single record by creating multiple sub rows on the right inside the machine_ID "8910" on the left. I have added the "My aim" section in my question for a better understanding. – Ishaan_B Aug 12 '22 at 22:41
  • Does this answer your question? [Create multiindex from existing dataframe](https://stackoverflow.com/questions/44442831/create-multiindex-from-existing-dataframe) – Vladimir Fokow Aug 12 '22 at 23:19

2 Answers2

0

Maybe a Multi-Index is what you're looking for?

df.set_index(['machine_ID', df.index])

Note that by default set_index returns a new dataframe, and does not change the original. To change the original (and return None) you can pass an argument inplace=True.

Example:

df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
                   'a': [1, 2, 3, 4, 5],
                   'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index])  # not in-place
df.set_index(['machine_ID', df.index], inplace=True)  # in-place

For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:

Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27
  • The output expected by the user in the reference link provided is exactly what I am looking for. In the above case (reference link question) clustering multiple records into a single record based on 'user_id' is exactly what I am aiming to implement with 'machine_ID' in my case. – Ishaan_B Aug 12 '22 at 23:15
  • The solution: df.set_index(['machine_ID', df.index]) still fails to achieve what I am aiming for with my data frame. – Ishaan_B Aug 12 '22 at 23:17
  • @Ishaan_B, Why does it fail? – Vladimir Fokow Aug 12 '22 at 23:20
  • When I implement the following, I get a data frame with no change but 'machine_id' column shifted slightly below the rest of the columns : `all_task_usage_10_19_clustering_through_machine_id.set_index(['machine_ID', all_task_usage_10_19_clustering_through_machine_id.index])` – Ishaan_B Aug 12 '22 at 23:25
  • Also, I just observed in the (reference link question) that the user expects the columns to be divided into levels i.e. 'user_id' as level 0, 'account_num' as level_1 and so on and so forth but I am expecting a data frame where 'machine_id' is level 0 and everything else is level 1. – Ishaan_B Aug 12 '22 at 23:31
  • Vladimir Fokow, I just checked the above solution does not cluster the rows based on machine_id, I think I need to do something with the groupby() – Ishaan_B Aug 13 '22 at 15:48
  • Dear Vladimir Fokow, pardon me for my miscommunication, what you have proposed does divide the data frame into levels but I must've failed to communicate what I am aiming for i.e. creating a single row from multiple rows sharing the same **machine_id** in a single record (one row). – Ishaan_B Aug 13 '22 at 17:11
0

The below code worked for me:

all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
    'assigned_memory_usage', 'unmapped_page_cache_memory_usage',           'total_page_cache_memory_usage', 'maximum_memory_usage',
    'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
    'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
    'memory_accesses_per_instruction_(MAI)', 'sample_portion',
    'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
Ishaan_B
  • 9
  • 6