1

I have an original dataframe with 4 columns (for the example lets call them product_id, year_month, week, order_amount) and > 50,000 rows. There are 240 individual product_id values and each one of them behaves differently in the data, therefore I wanted to create individual dataframes from the original one based on individual product_id. I was able to do this by utilizing:

dict_of_productid = {k: v for k, v in df.groupby('product_id)}

this created a dictionary with the key being the product_id and the values being the columns: product_id, year_month, week, order_amount. Each item in the dictionary also maintained the index from the original df. for example: if product_id = dvvd56 was on row# 4035 then on the dictionary it will be on the dataframe created for product_id dvvd56 but with the index still being 4035.

What I'm stuck with now is a dictionary with df's as values but can't find a way to convert these values into individual dataframes I can use and manipulate. If there is a way to do this please let me know! I'll be very grateful. thank you

  • 2
    Have you tried : `dict_of_productid = {k: v for k, v in df.groupby('product_id').reset_index()}` – Michael Jacob Mathew Apr 23 '20 at 15:29
  • 1
    I'm not sure I understand - the `dict` values are already `DataFrame` objects. What are you struggling with? Also what exactly are you trying to accomplish here, because I'm pretty sure this is not the best way to go about it. If you mean you want these named as `df1`, `df2`, `df3` then it's probably best you just stick with accessing them with `dict_of_productid['dvvd56']` etc. – r.ook Apr 23 '20 at 15:59
  • there are 240 individual product_id, I could just call each value like you said doing `dict_of_productid['dvvd56'] but I would have to do that for all 240 of them. I was asking if there is a simpler way of doing this that will not make me explicitly code for every single one of them. – MiguelAChevres Apr 23 '20 at 16:22
  • Let's put it this way. What are your intents for these individual `DataFrame`s? Did you want to manipulate *copies* of these (i.e. original `df` is untouched), or did you want to propagate the change in the original `df` itself? If former case, you can't really get away from having separate references in either individual names or `dict` values. If second case, then what are you trying to do with these groups? It might be achievable without you needing to separate them in the first place. – r.ook Apr 23 '20 at 16:41
  • If you have some sort of identical process for each of these 240 `product_id` then you don't need to care about the individual frames, you can operate it directly on the `df` itself with built in functions, or use `df.apply` if you need to cater to conditions. – r.ook Apr 23 '20 at 17:03

1 Answers1

0

I found a way to go about this, but I dont know if this is the most appropriate way, but it might help for further answers in order to clarify what I want to do.

First step was to convert the unique values into a list and then sorting them in order:

product_id_list = df['product_id'].value_counts().index.to_list()
product_id_list = sorted(product_id_list)

After this was done I created a formula and then iterated over it with the individual values of the product_id_list:

def get_df(key): 
    for k in key: 
        df_productid = dict_of_productid[k]
    return df_productid

for c, i in enumerate(product_id_list):
    globals()[f'df_{c}'] = get_df([f'{i}'])

this allows me now to separate all the values of the dictionary that was created into separate dataframes that I can call without explicitly stating the product id. I can just do df_1 and get the dataframe.

(I dont know if this is the most efficient way to go about this)

  • This would work but I would not recommend manipulating the `globals()` directly like this. [This is a relevant thread with multiple solutions for what you're doing here](https://stackoverflow.com/questions/1373164/how-do-i-create-a-variable-number-of-variables). I would recommend using the `dict` that you have since it's easy to maintain and reference back. `dict_of_productid['dvvd56']` is much more understandable than `df_1`, `df_2`, etc, and you also maintain the ability to apply the same function by iterating the `dict` instead of `func(df_1)`, `func(df_2)`, etc. – r.ook Apr 23 '20 at 17:01