1

I have a pandas dataframe like below:

   MSuite  TCase   KWord  
 0 MS1     Nan     Nan     
 1 NaN     T1      NaN     
 2 NaN     NaN     K1  
 3 NaN     NaN     K4 
 4 NaN     NaN     K8 
 5 NaN     NaN     V3
 6 NaN     T2      NaN 
 7 NaN     NaN     K7 
 8 NaN     NaN     K12 
 9 NaN     NaN     V10 
10 MS2     NaN     NaN 
11 NaN     T3      NaN 
12 NaN     NaN     K22
13 NaN     NaN     K15    
14 NaN     NaN     V3 
15 NaN     T4      NaN 
16 NaN     NaN     K10 
17 NaN     NaN     K4 
18 NaN     NaN     K12
19 NaN     NaN     K2
20 NaN     NaN     V6   

And I want to split it into the following matricies:

Matrix 1 - K Matrix

[[1, 4, 8, 7, 12, 22, 15, 10, 4, 12, 2],  
 [1, 1, 1, 2,  2,  3,  3,  4, 4,  4, 4],
 [1, 1, 1, 1,  1,  2,  2,  2, 2,  2, 2]]

For the top row

Each K# is stored in a separate dataframe/array that I want to get the index of for each K in chronological order.

For the middle row

I want to denote under which TCase each K is stored in, i.e. T1 is consisting of K1, K4, K8...

Bottom Row

Similar to middle row, but with the MSuite denoting.

Matrix 2 - V Matrix

 [[3, 10, 3, 6],
 [4,  3, 3, 5], 
 [1,  1, 2, 2]]

Top Row

Similar to top row in Matrix 1, where I have a separate file with a list of all the various V# available, and I would like to just grab the index of the file V# that is the same as the one in the dataframe.

Middle Row

The index of the V# within each group of T#, i.e in T1, V3 is the 4th KWord, and for T2, V10 is the 3rd KWord.

Bottom Row

Denoting which MSuite the V# is falling under. i.e V3 in T1 as well as V10 in T2 are both falling under MS1 -> 1 while V3 in T3 and V6 in T4 are both falling under MS2 -> 2.

Attempt So Far

I began by looping through the dataframe, but came across this word of advice. So now I am wondering if I need to somehow implement vectorization, but I really do not know where to start with that.

If not vectorization, I would continue to loop through the data frame, while caching at each MSuite and TCase increase, and appending the cached values.

Adam Kit
  • 47
  • 10

1 Answers1

2

You can try of using string contain along with string extract to achieve that

def Process(df,Char):
    df = df.loc[df.KWord.fillna('').str.contains(Char)]
    return df.apply(lambda x: x.str.extract('(\d+)')[0].rename(x.name)).astype(int).T.values[::-1]
df = pd.read_clipboard()
df[['MSuite','TCase']] = df[['MSuite','TCase']].ffill()
Process(df,'K')

Out:

array([[ 1,  4,  8,  7, 12, 22, 15, 10,  4, 12,  2],
       [ 1,  1,  1,  2,  2,  3,  3,  4,  4,  4,  4],
       [ 1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2]])
Naga kiran
  • 4,528
  • 1
  • 17
  • 31
  • Well that looks nifty. I am kind of lost with the first line in the process function, and where do you find the patterns for the extract method? Thank you for the answer, as it seems correct, but It would be awesome if you could comment the code a little bit, to let me know what exactly you are doing. Thanks. – Adam Kit Mar 15 '20 at 04:45
  • 1
    Oh wait no you edited it.. ignore the comment ^^ thanks. – Adam Kit Mar 15 '20 at 04:47
  • 1
    Very nice +1 , i had a similar method with the only change as: `df = df[df.KWord.str.contains(Char,na=False)] : return df.stack().str.extract('(\d+)',expand=False).astype(int).unstack().to_numpy().T[::-1]` :) – anky Mar 15 '20 at 04:52
  • 1
    Thanks @anky_91 , almost similar approach :) – Naga kiran Mar 15 '20 at 04:54
  • I like the idea guys, but I want to point out that K1 and K2 are not simply to just have their letter removed and then the number kept. They correspond to items in a different dataframe that I would like to allude to i.e there is a dataframe with all the K's, and I would like to then grab the index from that dataframe, and use that value in the final matrix. – Adam Kit Mar 16 '20 at 03:07