1

I am trying to clean my csv file. As doing so I had to copy rows for the products that had 'Quantity'>1. I did that like this:

file_in = file_in.loc[file_in.index.repeat(file_in.´Quantity)].reset_index(drop=True) 

But now I have another problem. In another column for some of my products I have SerialNumbers , which were ,at the begining all in one row added, and as I did the repeat they all got all of the Serial Numbers. Now I want to split those Serial Numbers so that each row has only one .

So from this:

   Quantity   SerialNumbers       ProductName
1   3           AB1 AB2 AB3            PR
2   3           AB1 AB2 AB3            PR
3   3           AB1 AB2 AB3            PR
4   1             CD                   monitor
5   4            NAN                   mouse
6   4            NAN                   mouse
7   4            NAN                   mouse
8   4            NAN                   mouse
9   4          EF1 EF2 EF3 EF4         ace
10   4          EF1 EF2 EF3 EF4         ace 
11   4          EF1 EF2 EF3 EF4         ace
12   4          EF1 EF2 EF3 EF4         ace
13   2            NAN                 screeen
14   2            NAN                 screeen
15   1            NAN                 flash
...

I need it to be like this:

   Quantity   SerialNumbers       ProductName
1   3           AB1                    PR
2   3           AB2                    PR
3   3           AB3                    PR
4   1           CD                   monitor
5   4           NAN                   mouse
6   4           NAN                   mouse
7   4           NAN                   mouse
8   4           NAN                   mouse
9   4           EF1                     ace
10   4           EF2                     ace 
11   4           EF3                     ace
12   4           EF4                     ace
13   2          NAN                 screeen
14   2          NAN                 screeen
15   1          NAN                 flash
..

I tried adding a new column where I would first create the list from my Serial Numbers and then by use of groupby function edit. I tried that using this , but it doesn't work.

file_in['Temp_Split'] = numpy.where((~file_in['Temp_Split'].isna()) & (file_in['Quantity']>1),file_in['Temp_Serial'].str.split() & file_in.groupby(['Position','Quantity'])['Temp_Split'].rank(method='first'), file_in['Temp_split'])

I searched for similar questions, but they are all for copying the rows while spliting the values. Maybe I should change the first line of code so that as it does the repeat it checks for the SerialNumbers and edit that also. But nor really sure how should I do that.

EDIT: I am not trying to copy rows while spliting the value in one column, I am trying to remove what is not needed for one row and add to the next one. As shown in example. There are still same amount of rows as at the begining

miwa_p
  • 435
  • 1
  • 4
  • 19
  • Possible duplicate of [Split (explode) pandas dataframe string entry to separate rows](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows) – asongtoruin Apr 26 '19 at 10:18
  • @asongtoruin I looked at that question but it copies each row again. I don't need that – miwa_p Apr 26 '19 at 10:21
  • why not just revert back to your original file and do what has been done in the dupe though? i.e. use its suggestion for copying the rows rather than your own. – asongtoruin Apr 26 '19 at 10:25
  • Because it would only copy part of my rows. Because not each of my product has a SerialNumber – miwa_p Apr 26 '19 at 10:27
  • Could you not use `df.drop_duplicates(inplace=True)` first, then apply the solution from @asongtoruin suggestion? – Chris Adams Apr 26 '19 at 10:27
  • Why is not repeated row `5 4 NAN mouse` 4 times? And `10 8 NAN screeen` 8 times? – jezrael Apr 26 '19 at 10:28
  • Sorry guys I EDITED my columns. Now it should show how it is – miwa_p Apr 26 '19 at 10:36

2 Answers2

1

Try:

df.SerialNumbers.replace({'NAN':np.nan}, inplace=True)
df['count'] = df.groupby('SerialNumbers').SerialNumbers.cumcount()

f = (lambda x: x['SerialNumbers']
               if x['SerialNumbers'] is np.nan
               else x['SerialNumbers'][x['count']])

df['SerialNumbers'] = df.SerialNumbers.str.split(' ')
df['SerialNumbers'] = df.apply(lambda x: f(x), axis=1)

df.drop(columns='count')

    Quantity SerialNumbers ProductName
0          3           AB1          PR
1          3           AB2          PR
2          3           AB3          PR
3          1            CD     monitor
4          4           NaN       mouse
5          4           NaN       mouse
6          4           NaN       mouse
7          4           NaN       mouse
8          4           EF1         ace
9          4           EF2         ace
10         4           EF3         ace
11         4           EF4         ace
12         2           NaN     screeen
13         2           NaN     screeen
14         1           NaN       flash

I used Groupby.cumcount to find the position of each string then str.split and apply to select item in list.

ResidentSleeper
  • 2,385
  • 2
  • 10
  • 20
0

you could split and take the 1st one

import pandas as pd
df1 = pd.DataFrame({'SerialNumbers': {0: 'aaa aa aa', 1: 'as df', 2: 'fg 67', 3: 'as uy'},'ProductName':{0:80,1:12,2:44,3:61}})

df1['new_col'] = df1['SerialNumbers'].str.split(' ').str[0]
df1 = df1.loc[:,['ProductName','new_col']]
print(df1)
Zaynul Abadin Tuhin
  • 31,407
  • 5
  • 33
  • 63