1

I am working on a data set where the values in PLU column are scattered all over the place like: I have 4 columns out of more than 500:

Inventory_No | Description | Group | PLU
----------------------------------------------
93120007     | Coke        |Drinks | 1000
93120008     | Diet Coke   |Drinks | 1003
93120009     | Coke Zero   |Drinks | 1104
93120010     | Fanta       |Drinks | 1105

93120011     | White Bread |Bread  | 93120011     
93120012     | whole Meal  |Bread  | 93120012     
93120013     | Whole Grains|Bread  | 110011
93120014     | Flat white  |Breads | 1115092

I want my output to be like below that if there is any value with the length more than 6 digits in PLU column, system checks the next available number in sequence of PLU with less than 4 digits in length and add an increment of 1 in it and assign the PLU value to that row and does not change any existing less than 6 digits PLU's:

Inventory_No | Description | Group | PLU
----------------------------------------------
93120007     | Coke        |Drinks | 1000
93120011     | White Bread |Bread  | 1001
93120012     | whole Meal  |Bread  | 1002
93120008     | Diet Coke   |Drinks | 1003
93120014     | Flat white  |Breads | 1004
   .         |     .       |  .    |   .
   .         |     .       |  .    |   .
   .         |     .       |  .    |   .
93120009     | Coke Zero   |Drinks | 1104
93120010     | Fanta       |Drinks | 1105
93120013     | Whole Grains|Bread  | 110011

I want the next available value in the sequence with less than 6 digits and increment it by 1 and if it finds a sequence for any number of incremental values then skip the sequence and start with next available value after the sequence as long as the sequence is less than 6 digits long:
I have checked the below links and they are towards filling the sequence with 0 or Nan values
fill-in-a-missing-values-in-range-with-pandas
missing-data-insert-rows-in-pandas-and-fill-with-nan

Thanks in advance for your answers. Regards,

user3483203
  • 50,081
  • 9
  • 65
  • 94
Baig
  • 469
  • 2
  • 7
  • 19
  • Is it not enough to iterate through all the values in `PLU` and assign an incremented value from the row before? And I guess that you implicitly assume that your assigned numbers will not grow over 6 digits when they could collide with the numbers that are assigned from the outside? – Krzysztof Słowiński Aug 19 '18 at 15:44
  • 1
    Since you'll be changing some PLUs anyway... can you not just re-write the lot? If it needs to be consistent to reference something else that's fine but that's not going to be the case for where you change them... so... it seems if you're happy to do that, you should just give them all brand new codes starting from 1000 – Jon Clements Aug 19 '18 at 16:04

2 Answers2

3

Setup

print(df)

   Inventory_No   Description   Group       PLU
0      93120007          Coke  Drinks      1000
1      93120008     Diet Coke  Drinks      1003
2      93120009     Coke Zero  Drinks      1104
3      93120010         Fanta  Drinks      1105
4      93120011   White Bread   Bread  93120011
5      93120012    whole Meal   Bread  93120012
6      93120013  Whole Grains   Bread    110011
7      93120014    Flat white  Breads   1115092

First, let's create a list of values that we can use to fill that aren't contained in df.PLU:

fillers = [
    i for i in np.arange(df.PLU.min(), df.PLU.min() + len(df)) if i not in set(df.PLU)
]
# [1001, 1002, 1004, 1005, 1006, 1007]

Now we can make a series with our new values and fill:

condition = df.PLU.ge(1e6)
s = df.loc[condition]
fill = pd.Series(fillers[len(s):], index=s.index)
df.assign(PLU=df.PLU.mask(condition).fillna(fill).astype(int)).sort_values('PLU')

Output:

   Inventory_No   Description   Group     PLU
0      93120007          Coke  Drinks    1000
4      93120011   White Bread   Bread    1001
5      93120012    whole Meal   Bread    1002
1      93120008     Diet Coke  Drinks    1003
7      93120014    Flat white  Breads    1004
2      93120009     Coke Zero  Drinks    1104
3      93120010         Fanta  Drinks    1105
6      93120013  Whole Grains   Bread  110011
user3483203
  • 50,081
  • 9
  • 65
  • 94
1

Example DataFrame:

df = pd.DataFrame({'PLU': ['1001', '1002', '1110679', '1003', '1005', '12345', '1234567', '1231231231312', '1003', '1110679']}

Get the next unused 4 digit number:

start_at = int(df['PLU'][df.PLU.str.len() == 4].max()) + 1

Build an iterable from the starting number until 10000 (so the range is to 9999 at most - eg: only 4 digits):

spare_code = iter(range(start_at, 10000))

Where the length of the PLU is more than 6 characters, replace with the next spare code...

to_replace = df['PLU'].str.len() > 6
df.loc[to_replace, 'PLU'] = df.PLU[to_replace].map(lambda v: str(next(spare_code)))

Gives you an amended df of:

     PLU
0   1001
1   1002
2   1006
3   1003
4   1005
5  12345
6   1007
7   1008
8   1003
9   1009
Jon Clements
  • 138,671
  • 33
  • 247
  • 280