0

My goal is to combine two columns into a third column 'Priority' (step 1). Next, I count each instance of combined values in the new 'Priority' column (step 2). I then filter out instances where the combined value ie 'Priority' was 1 (step 3). Next I delete each row that has cancelled in the 'WO_Stat' column if the combined value count for the column we created in (step 2) is greater than 1 (step 4) .

I believe the previous steps I have done correctly; in my code comments I have commented where I got lost "#above, this was working fine 9.24 but not sure if it makes sense, also need to work on below".

I need most help on step below.

Step 5 For the values in 'Priority' that had a count of over 1 delete the rows where their 'Order_Qty' is less than 16 only in the event thought that the same 'Priority value' has another 'Order_Qty' of great than 99. (mind there may be up to 10 counts of each "Priority Value" ,so you would maybe only be deleting 4 if lets say the Order_Qty was 10,10,9,8,2000,2000,2000,4000,3000,300)

If you cannot help with the logic, even just help with making this code run quicker, it takes almost an hour with 40k lines of data. Maybe I could include dynamic programming or format column data type better?

import pandas as pd
import numpy as np
from numpy import NaN
df = pd.read_excel("ors_final.xlsx", encoding = "ISO-8859-1", dtype=object) #used to read xls file named vlookuped but now changed to ors_final as of 2.20.19
df['Priority']= df['Priority'].astype('str')
df['Cust_PO_Number']= df['Cust_PO_Number'].astype('str')
df['Cust_PO_Number']= df['Cust_PO_Number'].astype('str')
df['Item_Number']= df['Item_Number'].astype('str')
df['Sub_Priority']= df['Sub_Priority'].astype('str')
# creating second df
df_deleted = df.copy(deep=True)
df_deleted.drop(df.index,inplace=True)
# creating variable for small value first art
LOWER_THRESHOLD = 16
#
print("1. combine po number and item number")
for i, row in df.iterrows(): #iterate through each row with with row value and row content
    a = str(row['Cust_PO_Number'])
    b = str(row['Item_Number'])

    concat = a + b

    df.set_value(i, 'Priority', concat)
#worked 9.23
print('2. Count all the duplicates of the combined values above')
seen = {}
for i, row in df.iterrows(): # now we will count the combined values, not dict keys cant have dupe values
    c = row['Priority']

    if c not in seen: # have not seen the letter before, we need to establish this
        seen [c] = 0

    seen[c] += 1 # Seen concatted values once, add one.
for i, row in df.iterrows(): #put the recorded numbers in, now we loop thorugh each row to get the value of c to call it as it's key (dict) value
    c = row['Priority']

    times_seen = seen[c]

    df.set_value(i, 'Mfg_Co', times_seen)
print("3. Ignore instances of rowes  where concat is not one")
for i, row in df.iterrows():
      d = row['Mfg_Co']
      if d == 1.0:
          df.set_value(i,'Sub_Priority',True)
      else:
          df.set_value(i,'Sub_Priority',False)

print('4. Delete all rows where orders are cancelled but concated column is more than 1')
delete_these = []
for i, row in df.iterrows():
      f = row['WO_Stat']
      d = row['Sub_Priority']

      if str(f) == 'Cancelled' and d != True:
          delete_these.append(i)
          df_deleted = df_deleted.append(row) # this does not append dataframe yet looking into 9.23

df.drop(delete_these, axis=0, inplace=True)

#above this was working 9.24 but had not tested the data integrity , looked pretty good tho
over_numbers = {}
for i, row in df.iterrows(): #determine if its over a number, still working out kinks 9.24
      c = row['Priority']
      g = row['Order_Qty']

      if float(g) > float(99):
          over_numbers[c] = True
#little confused on below on
print('step 5')
for i, row in df.iterrows(): # storing the numbers over 99
    c = row['Priority']

    if c in over_numbers:
        df.set_value(i, 'Comments_Status',True)
    else:
        df.set_value(i,'Comments_Status',False)
#above, this was working fine 9.24 but not sure if it makes sense, also need to work on below
## 
delete_these = []

for i, row in df.iterrows(): # Remove all rows that have over_number = True and also number less than 16
    d = row['Sub_Priority'] # should this be changed?
    f = row['Comments_Status']

    if d <= LOWER_THRESHOLD and f is True: # so grouping 1st arts
        delete_these.append(i) # store row number to drop later
        df_deleted = df_deleted.append(row) # Add the row to other dataframe

df.drop(delete_these, axis=0, inplace=True)

#step 5 was not working as of 10.2, it was breaking out the first article data wrong

writer = pd.ExcelWriter('1start.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()

writer = pd.ExcelWriter('deleted1start.xlsx', engine='xlsxwriter')
df_deleted.to_excel(writer, sheet_name='Sheet1')
writer.save()

--- New format of question, with attempt to make easier to understand/help---

    import pandas as pd

df = pd.DataFrame({'Column_A':['test1', 'test7', 'test7', 'test4', 'test6', 'test6', 'test7'],'Column_B':['WO1','WO7', 'WO7', 'WO6', 'WO6', 'WO6', 'WO7'],
                   'Column_A_B': ['','','','','','','',], 'Satus': ['Cancelled','Cancelled', 'Active', 'Active', 'Open', 'Active', 'Active'],
                   'Qty': ['12', '34' , '13', '3000', '14', '88', '1500']})

Please view Example DataFrame above and my step by step goals: Step 1: Combine Column A and Column B into Column_A_B Step 2: Count each instance of the values in 'ColumnA_B' Step 3 Filter out rows where there is only 1 instance of the value in 'ColumnA_B' Step 4: Delete each row that has cancelled in the 'Status' Column, and just the row that has canceled in it - there may be some with the same value in ColumnA_B but different "Status' values ( note while the Step three filter is being applied) Step 5: With the filter still on for 'Column_A_B' ( ie the filtering out count of 1) look at the redundant values ( so when you count the values in 'Column_A_B_' is would be 2 or greater) and then for said grouped counts look at the 'Qty' column. If this group has a Qty of less than 16 AND over 99 delete just the row that had 'QTY' of 16. IF the grouping had "QTY's of all less than 99 don't delete anything , if all the 'QTY' values are over 99 dont delete anything.

The result Df of this logic will be this:

import pandas as pd

    goaldf = pd.DataFrame({'Column_A':['test1', 'test4', 'test6', 'test6', 'test7'],'Column_B':['WO1', 'WO6', 'WO6', 'WO6', 'WO7'],
                   'Column_A_B': ['test1W01','test4WO6','test6WO6','test6WO6', 'test7WO7'], 'Satus': ['Cancelled', 'Active', 'Open', 'Active', 'Active'],
                   'Qty': ['12', '3000', '14', '88', '1500']})
0004
  • 1,156
  • 1
  • 14
  • 49
  • 1
    Could you post some example data and the expected output that would result from that data? It's very difficult to offer much help without that. Also check out [how to ask proper pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Peter Leimbigler Feb 21 '19 at 00:31
  • 1
    @PeterLeimbigler - Thank you so much for your input. I edited my comment and added below to what I think to be what you are looking for (example input data and example output data) , as well as what your example provided. I did not include the code as directly relates to the example df as I am not sure if my original code was correct, so I wrote out that logic. Please let me know if I can further help you help me. – 0004 Feb 21 '19 at 02:30
  • @PeterLeimbigler so sorry, I understand if your tired of hearing from me, but I updated the test df and goal df. – 0004 Feb 22 '19 at 02:08
  • Sorry for leaving this unanswered! Thank you for the example data and output. But in general, I can't see a clean, compact way of translating these specific rules into pandas code. The answer by @johnnyb is about all the improvement I would be able to come up with myself... – Peter Leimbigler Feb 22 '19 at 03:35
  • @PeterLeimbiglerq I understand and thank you for the time, really it is just step 5 I need help with ( sifting out qty under 13, when the group also has a qty over 99) I have another post going: https://stackoverflow.com/questions/54819211/pandas-logic-room-for-dynamic-programming that has direct code on the sample data so you can easily run it on your machine ( copy and paste in idle) but I understand if you dont have the time/dont understand fully. Thanks again for your feedback! – 0004 Feb 22 '19 at 03:50

1 Answers1

1

I second @PeterLeimbigler comments but I would suggest as a few overall things to help with your code. I would suggest only using iter if absolutely necessary, personnaly I have found it to be much slower than the standard pandas methods to conduct business. See below for a few changes I would make.

#To concat two columns into one as a string type 
df["NewCol"] = df["Col1"].astype(str) + df["Col2"].astype(str) # assigns the concated values to the new column instead of iterating over each row, much faster this way

# To get assign count column with your data giving you a by row count of how many times NewCol's row value has been seen in total dataframe
df['Counts'] = df.groupby(['NewCol'])['NewCol'].transform('count') # The count ignores nan values 

# If your intent is to just compare two rows to get a count duplicate based on both columns, keep your data as ints and do this 
df['Counts'] = df.groupby(['col1', 'col2'])['coltocount'].transform('count')

# Alternate method to count values 
countcol1 = df['Col1'].value_counts 
counts = countcol1.to_dict() #converts to dict
df['Count'] = df['Col1'].map(counts) 

# To get true false values based on a specific column's data 
df["Truethiness"] = (df["ColToCompare"] == 1.0)  # This can be multiple conditions if need be. 

# To conditionally drop rows from a pandas dataframe
df = df.drop(df[<some condition>].index

# If you need to save the data from the conditional drop
df2 = df.drop(df[<Alternate condition of above>].index
johnnyb
  • 1,745
  • 3
  • 17
  • 47