2

Data set

What I am trying to do is insert records into a dataset whenever a line is missing.

If you look at the data set above, it contains 3 columns of attributes and then 2 numeric values. The third column TTF, is incremental and should not skip any values. In this example it is missing 2 rows which are shown at the bottom. So what I want my code to do would be insert those 2 rows into the result set (i.e. Computer - Display is missing TTF of 5, and Television - Power Supply is missing TTF of 6. I would set the repair value to 0, and the running total value to the same as the previous row).

I was thinking I would approach it by splitting the column names and recursively walking through the first 2, and then 1 to 8 for the third.

for i in range(len(Product)):
    for j in range(len(Module)):
        for k in range(1, 8):  
            # Check if the Repair value is there if not make it 0
            # If Repair value is missing, look up previous Running Total

Does this seem like the best approach? Any help with the actual code to accomplish this would really be appreciated.

EDIT: Here is code reading in the DF, since that seems to be confusing based on the excel screenshot.

>>> import pandas as pd
>>> 
>>> df = pd.read_csv('minimal.csv')
>>> 
>>> df
       Product         Module   TTF   Repair   Running Total
0     Computer        Display     1        3               3
1     Computer        Display     2        2               5
2     Computer        Display     3        1               6
3     Computer        Display     4        5              11
4     Computer        Display     6        4              15
5     Computer        Display     7        3              18
6     Computer        Display     8        2              20
7   Television   Power Supply     1        7               7
8   Television   Power Supply     2        6              13
9   Television   Power Supply     3        4              17
10  Television   Power Supply     4        5              22
11  Television   Power Supply     5        6              28
12  Television   Power Supply     7        7              35
13  Television   Power Supply     8        8              43
  • What you've posted though is a screenshot of what looks like an Excel file. You should really show how you've read this data into Python/pandas and how you intend to proceed from there. Have you loaded the single excel file into a single DataFrame? – roganjosh Jan 04 '18 at 19:34
  • I created the excel screenshot since I thought it would display best in the question. In actuality I am pulling in a result set from Teradata and that creates the Pandas df. – Michael Melillo Jan 04 '18 at 19:48
  • Really, it would be easier if you provided a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). Also see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). It's tough to understand how you can progress from the Excel SS - is it a single DF? In which case, you could just simply sort the data. – roganjosh Jan 04 '18 at 19:53
  • I have made an edit to the original question that shows code that gets me to where I am. – Michael Melillo Jan 04 '18 at 20:04

1 Answers1

3

Let's use reindex to create new TTF for missing number in sequence with np.arange:

df = pd.DataFrame({'Product':['Computer']*7 + ['Television']*7,'Module':['Display']*7 + ['Power Supply']*7,
                 'TTF':[1,2,3,4,6,7,8,1,2,3,4,5,7,8],'Repair':np.random.randint(1,8,14)})

df['Running Total'] = df['Repair'].cumsum()

print(df)

Input Dataframe:

          Module     Product  Repair  TTF  Running Total
0        Display    Computer       6    1              6
1        Display    Computer       2    2              8
2        Display    Computer       2    3             10
3        Display    Computer       4    4             14
4        Display    Computer       2    6             16
5        Display    Computer       3    7             19
6        Display    Computer       6    8             25
7   Power Supply  Television       3    1             28
8   Power Supply  Television       3    2             31
9   Power Supply  Television       5    3             36
10  Power Supply  Television       6    4             42
11  Power Supply  Television       4    5             46
12  Power Supply  Television       2    7             48
13  Power Supply  Television       2    8             50


df_out = df.set_index('TTF').groupby(['Product','Module'], group_keys=False).apply(lambda x: x.reindex(np.arange(1,9)))

df_out['repair'] = df_out['Repair'].fillna(0)

df_out = df_out.ffill().reset_index()

print(df_out)

Output:

    TTF        Module     Product  Repair  Running Total  repair
0     1       Display    Computer     6.0            6.0     6.0
1     2       Display    Computer     2.0            8.0     2.0
2     3       Display    Computer     2.0           10.0     2.0
3     4       Display    Computer     4.0           14.0     4.0
4     5       Display    Computer     4.0           14.0     0.0
5     6       Display    Computer     2.0           16.0     2.0
6     7       Display    Computer     3.0           19.0     3.0
7     8       Display    Computer     6.0           25.0     6.0
8     1  Power Supply  Television     3.0           28.0     3.0
9     2  Power Supply  Television     3.0           31.0     3.0
10    3  Power Supply  Television     5.0           36.0     5.0
11    4  Power Supply  Television     6.0           42.0     6.0
12    5  Power Supply  Television     4.0           46.0     4.0
13    6  Power Supply  Television     4.0           46.0     0.0
14    7  Power Supply  Television     2.0           48.0     2.0
15    8  Power Supply  Television     2.0           50.0     2.0
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thanks! This definitely does exactly what I'm looking for. Any suggestions on how to change the lambda to be dynamic rather than 1,9? I.e. the data could have different ranges of TTF over time. – Michael Melillo Jan 04 '18 at 20:09
  • @MichaelMelillo You can use df.TFF.min() and df.TFF.max() inside the np.arange function.`df_out = df.set_index('TTF').groupby(['Product','Module'], group_keys=False).apply(lambda x: x.reindex(np.arange(df.TTF.min(),df.TTF.max()+1)))` – Scott Boston Jan 05 '18 at 02:28