0

The program creates some random products and then creates orders by randomly choosing a product. Right now every order only has one item; a future version will randomize the number of line items per order.

I've never used Python or Pandas before and I wanted to make sure that my approach is the most efficient way of adding a new row to a DataFrame and selecting a random row from a DataFrame.

Any suggestions?

Thank you

def get_random_products(count=500):
    x = 0
    df = pd.DataFrame(columns=['product_id', 'SKU', 'price', 'category', 'size', 'color', 'style', 'gender'])
    while x < count:
        row = pd.DataFrame([[x
                      ,get_random_SKU()
                      ,get_price()
                      ,get_category()
                      ,get_size()
                      ,get_color()
                      ,get_style()
                      ,get_gender()]]
                      ,columns=['product_id', 'SKU', 'price', 'category', 'size', 'color', 'style', 'gender'])
        df = df.append(row
                      ,ignore_index=True)
        x += 1
    return df


#---

def get_random_orders(products, count=1000, start_order_id=1, number_of_customers=500):
    # CustomerID    OrderID OrderDate   Price   Category    Size    Color   Style   Gender
    x = 0
    df = pd.DataFrame(columns=['customer_id', 'order_id', 'order_date', 'SKU', 'price', 'category', 'size', 'color', 'style', 'gender'])
    while x < count:
        # Each time through, choose a random product to be in the order
        p = products.to_records()[random.randint(0, len(products)-1)]
        row = pd.DataFrame([[get_customer_id(number_of_customers)
                      ,x+1
                      ,get_order_date()
                      ,p['SKU']
                      ,p['price']
                      ,p['category']
                      ,p['size']
                      ,p['color']
                      ,p['style']
                      ,p['gender']]]
                      ,columns=['customer_id', 'order_id', 'order_date', 'SKU', 'price', 'category', 'size', 'color', 'style', 'gender'])
        df = df.append(row
                      ,ignore_index=True)
        x += 1
    return df

#Main code here

catalog = get_random_products(1000)
orders = get_random_orders(catalog, 1000, 1, 500)
  • 1
    My advice is if you have never use Python and Pandas before, you should focus on making your code easy to read and understand for you and others rather than optimizing it. It is only in a second step (if the performances require it) that you will do code refactoring. – Corralien Apr 28 '21 at 16:52

1 Answers1

0

For an efficient answer:

My suggestion dives a bit into rules for normalization for databases. The general idea of these rules is to reduce data redundancy (Why enter the same data more than once?). That said, the information is helpful in this scenario and will prepare your code for your end-goal of multiple line-items per order.

Luckily, customer/product/order databases are a typical example for these rules. For customers/orders/line-items, the typical recommendation would be to have one table for each of these types of "entities," and having columns in that table only pertain to the entity. If it is a table that relates one entity to another, there would be a column only containing an identifier for the foreign entity (i.e. 'customer_id' or 'SKU').

So, for your question, my initial setup would be the following:

# Customers DF, no foreign entities
c_df = pd.DataFrame(columns=['customer_id', 'name', 'gender'])

# Products DF, no foreign entities
p_df = pd.DataFrame(columns=['SKU', 'price', 'category', 'size', 'color', style')

# Orders DF, with 'customer_id' being an identifier to tie into c_df
o_df = pd.DataFrame(columns=['order_id', 'order_date', 'customer_id']) 

# Line-Items DF, with both order_id and SKU being foreign identities.
li_df = pd.DataFrame(columns=['order_id', 'SKU', 'quantity']) 

Once things are set up, I would generate the entities for each DF separately.

def _gen_customers(df, num=1):
    new_customers = []
    for num:
        new_customers = new_customers.append({
            customer_id: SOME_ID, # Not sure how you want these to be generated, but you could just replace SOME_ID with 'num' if it is arbitrary
            name: SOME_NAME, # I made this assuming you want to name the customers. Could just be left out if unnecessary.
            gender: random.choice(['m', 'f']) # https://docs.python.org/3/library/random.html#random.choice
    new_df = df.append(new_customers, ignore_index=True) # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
    return new_df

c_df = _gen_customers(c_df, 500)

You can do similar functions for the other DFs. To choose from foreign identifiers randomly, you can set up lists of unique values to choose from as so:

all_customers = c_df['customer_id'].unique() # https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html?highlight=unique#pandas.Series.unique

def _pick_customers(num):
    """ Returns a list of customer_ids of length 'num'. """
    return random.choices(all_customers, num)

For line items, just make one row for each sku per order. You can start with one line per order, but it is expandable to as many as you would like.

Then, typically, you would make the columns that will be searched-by most often the "index" to speed up the searching. ( See pandas.DataFrame.set_index )

c_df = c_df.set_index(keys='customer_id', drop=True)
p_df = o_df.set_index(keys='SKU', drop=True)
o_df = o_df.set_index(keys='order_id', drop=True)
li_df = li_df.set_index('order_id', drop=True)

You may then merge the DFs as applicable to whatever your scenario is.

Jake Steele
  • 103
  • 8
  • Thanks Jake. I need the denormalized data set so that's what I started with. But DF probably has a way to JOIN. Your sample appends to new_customer[] and adds that all at once to the DF. I'll give those a try. – Scott Powell Apr 29 '21 at 22:09
  • Understandable, @ScottPowell. If you needed to, you could always use a merge and save as a new/unnormalized DF for whatever use case scenario requires it. As for my reasoning for appending to ```new_customer[]``` before adding to DF all at once: typically people suggest not to add rows or data one at a time if you can avoid it. Since this would be a batch operation either way, it's usually safer to add all at the end. Good summary on that [here](https://stackoverflow.com/questions/13784192/creating-an-empty-pandas-dataframe-then-filling-it/56746204#56746204). – Jake Steele Apr 30 '21 at 20:00
  • Thanks Jakes. That's EXACTLY what I was worried about when I wrote the question and really appreciate the guidance. – Scott Powell Apr 30 '21 at 21:51
  • Any time! Did my answer work for you? If not, would you like for me to give it another shot focusing on keeping the data un-normalized? – Jake Steele May 04 '21 at 16:57
  • Hi Jake. It did help. The next bottleneck I found is retrieving a random product from products DF for the order line. Fetching by index is slower than I thought. Generating 1 Mil orders of up to 5 lines from 1 Mil product catalog takes 4 1/2 minutes. Generating the same data set from a static product is trivial time. I'll try normalizing the DF's to see if a single JOIN at the end is faster but any thoughts on iloc's speed? random_product_id = random.randint(0, len(products) - 1) p = products.iloc[random_product_id] – Scott Powell May 05 '21 at 19:02
  • I believe the fewer columns generated should speed things up significantly, but as for random row selections, I have accomplished a similar thing (on a smaller scale) with the following code to select up to 600 indices from a DF of 1000 rows. ```rng = np.random.default_rng() i_array = rng.choice(len(ntc_df), size=num_controls, replace=False) sliced = ntc_df.iloc[i_array]``` To ensure the same product could be chosen multiple times, simply change ```replace=False``` to ```replace=True```. – Jake Steele May 05 '21 at 22:25