3

I have a dataset where each row represents a unique sale. One column represents the items ordered in that sale, as CSV.

I need to separate the values in the "items ordered" column and for each item ordered, create a unique row, tied to the original order ID.

For example:

Order ID   Items Ordered
   127    Item 1, Item 2, Item 3

Should be:

Order ID   Items Ordered
   127         Item 1
   127         Item 2
   127         Item 3

I am relatively new to python/pandas. This code works, but is slow as I'm iterating over all the orders. What's a better way to do this?

temp_item_list = []

for row in raw_data.iterrows():
    for i in range(len(row['Items'])):
        temp_item_list.append((row['ID'], row['Items'][i]))

item_df = pd.DataFrame(temp_item_list)
ctd25
  • 730
  • 1
  • 11
  • 22

4 Answers4

4

To avoid iterating through all rows, you can use numpy.repeat to make columns Order ID while flattening the Items Ordered and make new data frame from this:

from itertools import chain
import numpy as np
import pandas as pd

# count the number of items in the items order column
items_count = df["Items Ordered"].str.count(",") + 1

# extend the Order ID column with numpy repeat and flatten Items Order column
pd.DataFrame({"Order ID": np.repeat(df["Order ID"], items_count),
              "Items Ordered": list(chain.from_iterable(df["Items Ordered"].str.split(",")))})

enter image description here

Psidom
  • 209,562
  • 33
  • 339
  • 356
3

With reference to similar answer you may try this step-by-step guide:

>>> initial_df = pandas.read_csv("your_file_path")
>>> final_df = pandas.concat([Series(row['Order ID'], row['Items Ordered'].split(',')) for _, row in initial_df.iterrows()]).reset_index()
>>> final_df
     index    0
0   Item 1  127
1   Item 2  127
2   Item 3  127
>>> final_df.columns= ['Items Ordered','Order ID']
>>> final_df
  Items Ordered  Order ID
0        Item 1       127
1        Item 2       127
2        Item 3       127
>>> final_df[['Order ID','Items Ordered']]
   Order ID Items Ordered
0       127        Item 1
1       127        Item 2
2       127        Item 3

This will get your job done.

Community
  • 1
  • 1
Shubham Namdeo
  • 1,845
  • 2
  • 24
  • 40
0

Something that might help you is pandas.DataFrame.apply. It allows you to apply a function to every row. You can define a function that grabs the items element in the row, splits it up, and calls the pandas.DataFrame.append to create a new row for each item.

Here is an example on how to use the apply function too.

Community
  • 1
  • 1
Chirag
  • 446
  • 2
  • 14
0

may this is what you are looking for;

df = pd.DataFrame({'order_id':127, 'items_ordered': ['item_1, item_2, item_3']})
df1 = pd.concat([df.order_id,df.items_ordered.str.split(',').apply(pd.Series)],axis=1,ignore_index=True)
df1 = df1.pivot(index = df1.index, columns = df1.columns[0]).stack().stack().to_frame()
df1 = df1.set_index(df1.index.get_level_values(level=1))

or alternatively a more succinct solution combining pivot and set_index steps above is:

df1 = pd.melt(df, id_vars = ['order_id'])
Siraj S.
  • 3,481
  • 3
  • 34
  • 48