0

A want to create a sample dataframe -- based on a json template -- that looks as real as possible. Hence normal distribution.

This is what I have tried

import json, random
import pandas as pd

sample_data = """{"product1":[
    {"category":"Fruits",
    "productlist":["Bell Peppers","Red Chillies", "Onions", "Tomatoes"]}
],
"product2":[
    {"category":"Vegetables",
    "productlist":["Apple","Mango","Banana"]}
]}"""

products = json.loads(sample_data)

colHeaders = []

for k,v in products.items():
    colHeaders.append(v[0]['category'])

df = pd.DataFrame(columns= colHeaders)

for i in range (1000):
    itemlist = []
    for k,v in products.items():
        itemlist.append(random.choice(v[0]['productlist']))
    #print(itemlist)
    df.loc[len(df)] = itemlist

print(df)

I am not sure I am doing it correctly. If not, please help me with

  • How to check if the data frame rows represent a normal distribution?
  • How to try other distributions in this case?

Other related Stack Overflow questions I have referred are:

kingmakerking
  • 2,017
  • 2
  • 28
  • 44

1 Answers1

0

I think what you should do is generate integers in normal distribution and make them the indices of the list. Also graphing the numbers you generated is in my opinion the best way to check whether they are a normal distribution, it should resemble the normal distribution bell shape. However since 20 is such a small number, it may not exactly be the desired shape which is something to keep in mind. The following link I think has all the information you need.

How to generate a random normal distribution of integers

Lara Ipek
  • 13
  • 7
  • `productlist = ["Apple","Mango","Banana"]` can be treated as `productlist = [0,1,2]` but wouldn't the rest of the logic still remain the same?` – kingmakerking Jan 11 '21 at 18:22
  • not sure what you mean here but i think your way works just fine, im just not sure if it would generate random distribution. if you do decide to use a different random generator, your code would change as this: `for k,v in products.items(): itemlist.append(v[0]['productlist'][random_integer])` the only problem here is that it generates random integers for the two lists seperately, meaning you have two rounded up distributions for ranges (0,3) and (0,4) if that is indeed what you wanted – Lara Ipek Jan 11 '21 at 18:45
  • From what you are suggesting, the random_integer will be in normal distribution but not the values in product list. The idea is the product list appended to the dataframe (as rows) to look like real occurrence. – kingmakerking Jan 11 '21 at 18:48
  • well since your product list has items in it, there is no real way to have a normal distribution between them. The thing I'm describing is only useful if the lists are ordered. From what I can tell, any random function will do what you want, especially since the example lists are so small anyway. – Lara Ipek Jan 11 '21 at 18:50