1

I have a csv file for eg

ID,Name,products

101,Tesco,Apple;Banana;Oranges

102,Lidl,Juice;Yogurt

103,Aldi,Fruits;vegetables;rice

Using the pandas library I want to split these into a new csv such that for the products column there is only one value for every field

The following code I've tried only selects those particular column and writes the values to a new csv. I want to modify the output.

df=pd.read_csv('final.csv',delimiter=",", index_col="ID",encoding="ISO- 
8859-1")
df1=df[,"Name","Products"]
df1.to_csv('a.csv',header='True')
print (df1)

I want the new csv generated to have the values in the following format-

101,Tesco,Apple

101,Tesco,Banana

101,Tesco,Oranges

102,Lidl,Juice

102,Lidl,Yogurt

103,Aldi,Fruits

..

  • Possible duplicate of [How do I unnest (explode) a column in a pandas DataFrame?](https://stackoverflow.com/questions/53218931/how-do-i-unnest-explode-a-column-in-a-pandas-dataframe) – ALollz Feb 17 '19 at 22:16
  • There's probably a way to do it without creating the list column, but using Wen-Ben's `unnesting` function it would be `unnesting(df.assign(products = df.products.str.split(';')), ['products'])` – ALollz Feb 17 '19 at 22:16
  • this is a little bit different because the desired output is actually extra rows not columns – gold_cy Feb 17 '19 at 22:18
  • Im not creating extra columns, just that most typical unnesting questions have the input as a list of items, not a delimited string. so to work with most solutions need to first transform `'Apple;Banana;Oranges'` to `['Apples', 'Banana', 'Oranges']`. The result is each element split to a new row. There's perhaps a dup that deals with unnesting strings directly, but using `.str.split` is trivial enough to get it in the desired format. – ALollz Feb 17 '19 at 22:29

1 Answers1

2

We can:

1) Split the items in each row with delimiter ";". Then we get one column for each item.

2) We then unstack the columns to get them as rows, and then remove the index level that is added based on the previous column values.

3) Name this pd.Series and join on the main df.

prod = df['products'].str.split(";", expand=True).unstack().reset_index(level=0, drop=True).dropna()

    ID
    101         Apple
    102         Juice
    103        Fruits
    101        Banana
    102        Yogurt
    103    vegetables
    101       Oranges
    103          rice

prod.name = 'product'

df = df.join(prod.to_frame())
df.drop('products', axis=1, inplace=True)

Output:

      Name     product
ID                    
101  Tesco       Apple
101  Tesco      Banana
101  Tesco     Oranges
102   Lidl       Juice
102   Lidl      Yogurt
103   Aldi      Fruits
103   Aldi  vegetables
103   Aldi        rice
user3212593
  • 496
  • 2
  • 8
  • Thanks, great solution. I have a question.. what if there is some redundancy of data values in certain rows like 1,Tesco,Apple,Banana,Apple,oranges In this case i would only want one of the "apples" and unique columns values like 1 Tesco apple 1 Tesco banana 1 Tesco Orange what modifications would be made then? – Rafeh Ishtiaq Feb 21 '19 at 17:17
  • If you want unique product values for each ID, you can group by ID and then drop_duplicates. So, if you do this on the prod series: prod.groupby('ID', group_keys=False).apply(lambda x: x.drop_duplicates()) – user3212593 Feb 21 '19 at 17:31
  • I tried it out and ran the code,but it didnt seem to work – Rafeh Ishtiaq Feb 21 '19 at 22:57
  • what was the problem? – user3212593 Feb 22 '19 at 00:45
  • It still didnt remove the duplicates, could you kindly show the updated code above in your answer – Rafeh Ishtiaq Feb 22 '19 at 00:53
  • prod.groupby('ID', group_keys=False).apply(lambda x: x.drop_duplicates()) outputs a new pd.Series, it does not change prod. Did you set prod = prod.groupby('ID', group_keys=False).apply(lambda x: x.drop_duplicates())? – user3212593 Feb 22 '19 at 00:57