4

I have a dataframe like this:

--------------------------------------------------------------------
Product        ProductType     SKU                Size
--------------------------------------------------------------------
T-shirt        Top            [111,222,333,444]   ['XS','S','M','L']
Pant(Flared)   Bottoms        [555,666]           ['M','L']
Sweater        Top            None                None

I want the below output:

Product       ProductType        SKU        Size
T-shirt       Top                111        XS
T-shirt       Top                222        S
T-shirt       Top                333        M
T-shirt       Top                444        L
Pant(Flared)  Bottoms            555        M
Pant(Flared)  Bottoms            666        L
Sweater       Top                None       None

I tried the following code:

s = df['SKU'].apply(Series,1).stack()
s.index = s.index.droplevel(-1)
s.name = 'SKU'
del df['SKU']
df = df.join(s)

r = df['Size'].apply(Series,1).stack()
r.index = r.index.droplevel(-1)
r.name = 'Size'
del df['Size']
df = df.join(r)

But this explodes into the following:

Product       ProductType   SKU             Size
T-shirt       Top           111             XS
T-shirt       Top           111             S
T-shirt       Top           111             M
T-shirt       Top           111             L
T-shirt       Top           222             XS
T-shirt       Top           222             S
T-shirt       Top           222             M
T-shirt       Top           222             L
T-shirt       Top           333             XS
T-shirt       Top           333             S
T-shirt       Top           333             M
T-shirt       Top           333             L
T-shirt       Top           444             XS
T-shirt       Top           444             S
T-shirt       Top           444             M
T-shirt       Top           444             L
Pant(Flared)  Bottoms       555             M
Pant(Flared)  Bottoms       555             L
Pant(Flared)  Bottoms       666             M
Pant(Flared)  Bottoms       666             L

Note that for simplicity sake, I have added two columns that will be repeated (Product, ProductType) but I have 5 such columns that contain strings. I basically want to associate the SKU with the size for each product.

Can anyone help here ?

coder1416
  • 121
  • 3
  • 11

1 Answers1

4

This is open to bugs so use with caution:

Convert Product column to a collection of lists whose sizes are the same with the lists in other columns (say, column SKU. This will not work if the lists in SKU and Size are of different lengths)

df["Product"] = df["Product"].map(list) * df["SKU"].map(len)

Out[184]: 
                    SKU           Size       Product
0  [111, 222, 333, 444]  [XS, S, M, L]  [a, a, a, a]
1            [555, 666]         [M, L]        [b, b]

Take the sum of the columns (it will extend the lists) and pass that to the dataframe constructor with to_dict():

pd.DataFrame(df.sum().to_dict())
Out[185]: 
  Product  SKU Size
0       a  111   XS
1       a  222    S
2       a  333    M
3       a  444    L
4       b  555    M
5       b  666    L

Edit:

For several columns, you can define the columns to be repeated:

cols_to_be_repeated = ["Product", "ProductType"]

Save the rows that has None values in another dataframe:

na_df = df[pd.isnull(df["SKU"])].copy()

Drop None's from the original dataframe:

df.dropna(inplace = True)

Iterate over those columns:

for col in cols_to_be_repeated:
    df[col] = df[col].map(lambda x: [x]) * df["SKU"].map(len)

And use the same approach:

pd.concat([pd.DataFrame(df.sum().to_dict()), na_df])

        Product ProductType    SKU  Size
0       T-shirt         Top  111.0    XS
1       T-shirt         Top  222.0     S
2       T-shirt         Top  333.0     M
3       T-shirt         Top  444.0     L
4  Pant(Flared)     Bottoms  555.0     M
5  Pant(Flared)     Bottoms  666.0     L
2       Sweater         Top    NaN  None

It might be better to work on a copy of the original dataframe.

ayhan
  • 70,170
  • 20
  • 182
  • 203
  • The size and the SKU will always be the same size. Also, what if I have multiple columns other than Product. For example: Product Type also which needs to be replicated along with Product. – coder1416 May 16 '16 at 16:33
  • Also.. tthe actually product is a string.. so when I am converting it to a list, It is splitting it as all the multiple characters in the string .. @ayhan – coder1416 May 16 '16 at 16:39
  • Lets say the data is like this: `Product Name t-shirt Product Type Tops` – coder1416 May 16 '16 at 16:40
  • @bjhav Can you edit the question to show the additional columns and sample product names? It will be easier to follow it on the question text rather than in the comments. – ayhan May 16 '16 at 16:44
  • This result is creating a list of lists for the repeated columns. It is not creating a string when outputting the result. – coder1416 May 16 '16 at 18:20
  • It is creating lots of lists of lists – coder1416 May 16 '16 at 18:26
  • It is supposed to create lists that contain repeated elements. In the final phase (`pd.DataFrame(df.sum().to_dict())`) you need to convert it to a dataframe. – ayhan May 16 '16 at 18:27
  • This works but the only issue I am having now is to deal with the cases where the SKU and the Size are None. – coder1416 May 16 '16 at 20:27
  • Again, can you show in the question how they are placed in the original dataframe and how do you want to handle them in the resulting dataframe. – ayhan May 16 '16 at 20:29
  • Edited the question. – coder1416 May 16 '16 at 20:46
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/112086/discussion-between-bjhav-and-ayhan). – coder1416 May 16 '16 at 21:01