1

Trying to learn ways to iterate or cycle thru a column in pandas. In vba this is a for-loop and then selecting offset from selected cell position, just one option. However, I'm here to learn pandas and having a hard time understanding how to keep the rows straight when comparing the next column adjacency to the right or two columns over. Another way to maybe say this. Once ttype column text is found in other dataframe mtype column, I would like to compare the adjacent values in both dataframes with each other.

I have attached the dataframes for testing. I'm not sure if a for loop is the best way to achieve this but I have started one. I'm reading that pandas is more efficient handling the entire column at one time. Not sure if that can be done here. My first 3 lines of code (2 for-loops and if statement) are working. It cycles thru the text and finds the match. But I'm grappling with the handling of the adjacency values. I have reading up on the iloc and loc statements because I feel they grab the row. But I'm not sure of the syntax. I'm not even sure I can ask the right question to get me where I need so I can learn. So any reading material that you can help guide me to about this would be appreciated. pandas loc vs. iloc vs. ix vs. at vs. iat? get column value based on another column with list of strings in pandas dataframe

What is needed: With the toc dataframe I would like to cycle thru each value in ttype column, if value exist in the moc dataframe mtype column, then compare toc[ta column value] < moc[ma column value], if true, then continue, if false then toc[outfilter] == '1'.

import pandas as pd
from pandas import DataFrame, Series
import numpy as np

toc = {'ttype':['ta1k', 'brek', 'sjfgd',
           'gru2d','brek','crhe','ta1k','jump4'],
       'ta':[1, 2, 9, 9, 2, 2, 1, 1],
       'tc':[0, 1, 0, 0, 1, 0, 2, 0],
       'outfilter':[0, 0, 0, 0,0, 0, 0, 0]}

toc = pd.DataFrame(toc)

moc = {'mtype':[ 'sjfgd','ta1k','gru2d',
            'brek','crhe','jump4'],
       'mo':[2, 2, 4, 4, 3, 4],
       'ma':[2, 2, 4, 4, 2, 3],
       'mc':[1, 1, 3, 3, 1, 1]}

moc = pd.DataFrame(moc)

#-----
for tval in toc['ttype']:                          # Gets toc['ttype'].value
    for mval in moc['mtype']:                      # Gets toc['mtype'].value
        if t == m:                                 # compares if tval == mval
            if toc.loc['ta'] < moc.loc['ma']:      # compares toc.[ta] column value < moc.[ma]
                continue
            else:
                toc.loc['outfilter'] = '1'         # if the above is greater place '1' in outfilter 
                                                   #  column
        else:
            continue
#-----
 print(toc)
 print(moc)


 What I would like to do:  The '1's located in the outfilter column are a result of the toc-df[ta 
 column value] being greater than moc-df[ma column value]. 

 toc-df  ttype   ta   tc   outfilter
     0   ta1k    1    0    0
     1   brek    2    1    0
     2   sjfgd   9    0    1
     3   gru2d   9    0    1
     4   brek    2    1    0
     5   crhe    2    0    0
     6   ta1k    1    2    0
     7   jump4   1    0    0

I really appreciate your guys helping out and one day I hope that I can return the favor and pay it forward. Thank You for your time.!!! Please let me know if if you have any questions.

Boomer
  • 229
  • 1
  • 9

1 Answers1

1
  1. I would merge the dataframes on the ttype and mtype columns similar to do an index match/vlookup in Excel, but you don't want to merge the entire moc dataframe, so just specify and merge in the columns you need (mtype and ma).
  2. From there, simply do an np.where to see if ta values are greater than ma values and return 1 or 0 similar to an Excel if formula.
  3. Finally, drop the columns you don't need

input:

import pandas as pd, numpy as np
toc = {'ttype':['ta1k', 'brek', 'sjfgd',
       'gru2d','brek','crhe','ta1k','jump4'],
   'ta':[1, 2, 9, 9, 2, 2, 1, 1],
   'tc':[0, 1, 0, 0, 1, 0, 2, 0],
   'outfilter':[0, 0, 0, 0,0, 0, 0, 0]}

toc = pd.DataFrame(toc)

moc = {'mtype':[ 'sjfgd','ta1k','gru2d',
        'brek','crhe','jump4'],
   'mo':[2, 2, 4, 4, 3, 4],
   'ma':[2, 2, 4, 4, 2, 3],
   'mc':[1, 1, 3, 3, 1, 1]}

moc = pd.DataFrame(moc)

code:

toc = pd.merge(toc,moc[['mtype','ma']],how='left',left_on='ttype',right_on='mtype')
toc['outfilter'] = np.where((toc['ta'] > toc['ma']),1,0)
toc = toc.drop(['mtype','ma'], axis=1)
toc

breakdown of code line-by-line:

step 1 (similar to an excel index-match formula):

pd.merge(toc,moc[['mtype','ma']],how='left',left_on='ttype',right_on='mtype')

   ttype  ta  tc  outfilter  mtype  ma
0   ta1k   1   0          0   ta1k   2
1   brek   2   1          0   brek   4
2  sjfgd   9   0          0  sjfgd   2
3  gru2d   9   0          0  gru2d   4
4   brek   2   1          0   brek   4
5   crhe   2   0          0   crhe   2
6   ta1k   1   2          0   ta1k   2
7  jump4   1   0          0  jump4   3

step 2 (similar to an excel IF formula):

toc['outfilter'] = np.where((toc['ta'] > toc['ma']),1,0)

    ttype  ta  tc  outfilter  mtype  ma
0   ta1k   1   0          0   ta1k   2
1   brek   2   1          0   brek   4
2  sjfgd   9   0          1  sjfgd   2
3  gru2d   9   0          1  gru2d   4
4   brek   2   1          0   brek   4
5   crhe   2   0          0   crhe   2
6   ta1k   1   2          0   ta1k   2
7  jump4   1   0          0  jump4   3

step 3 - final output (just dropping unneeded columns):

toc = toc.drop(['mtype','ma'], axis=1)

   ttype  ta  tc  outfilter
0   ta1k   1   0          0
1   brek   2   1          0
2  sjfgd   9   0          1
3  gru2d   9   0          1
4   brek   2   1          0
5   crhe   2   0          0
6   ta1k   1   2          0
7  jump4   1   0          0

If I thought a little bit more, there is probably an even simpler way to do this in python with just one line of cod using pandas methods, but this way is simple enough and easy to understand.

Also, VBA was the language I switched to from Pandas about 18 months ago as well. I would say that 99% of problems can be solved with pandas methods, list comprehension or .apply(lambda x:.... Pandas methods or numpy methods are always the way to go in terms of simplicity, speed, performance, etc. It is very popular to loop in VBA, but you should try to get away from that as quickly as you can and learn the various pandas methods.

David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • David, I am just impressed with this. Simple and it cuts thru 5k records like butter. For a moment in testing I was iffy on the merge, how, left_on, right_on statements. but they seem to work. I'm not sure how currently. Could you explain a bit or guide me over to some reading material you like? I kinda thought it was a merge placement or position in moving the column into the TOC df but now I'm thinking it is actually the text lookup that I mentioned first has to happen in order to guide off the correct numbers in each of the rows. – Boomer Jul 26 '20 at 03:27
  • I have even used your snippet for the sum of the two columns greater than another column. I haven't really too much of numpy so more practicing needed. Again. A huge thanks for your response!! I really appreciate it..!! – Boomer Jul 26 '20 at 03:27
  • “left_on” and “right_on” are required because the names of the columns are different. If they were both the same name, then you could remove “left_on” and “right_on” and simply use “on”. Also, assuming the same number of columns, pandas could easily do this on 20 million+ rows, and the big data library that is built off pandas called “dash” could easily do something like this on billions of rows. When you use vectorized methods with pandas or dask, it’s like pressing channel 44 one time on your remote control versus pressing the channel up button from channel 1 44 times. – David Erickson Jul 26 '20 at 06:04
  • There is a wealth of knowledge on the internet as python is one of the most popular programming languages in the world and pandas is tagged for 10% of all python questions, so if you google the right question, you can probably get answers to any question if you research and absorb what you are reading – David Erickson Jul 26 '20 at 06:06
  • Hey David, great information. I have much reading to do. But I have a more operational question for you about np.where maybe. I hope I can ask this correctly. How does the code iterate/cycle/or test if true with toc.ttype with the already merged toc.mtype text column.? My reason to ask is, the ttype value must be found in mtype column in order to compare correctly the toc.ta > toc.ma values to the right. I can understand the toc.ta > toc.ma compare line. But I'm not getting how the code knows the correct value to compare with. I'm cant have just toc.ta compared with toc.ma. np.where maybe? – Boomer Jul 26 '20 at 13:35
  • Sorry for asking this a different way & possible confusion. I just really want to understand how your code is working and comparing the values. – Boomer Jul 26 '20 at 13:36
  • @Boomer I've updated the answer to show the intermediary outputs for running each step. It should be clear how it works from that. – David Erickson Jul 27 '20 at 20:30
  • A huge help David. I appreciate your time with this. It was the np.where which I didn't feel comfortable with. I knew it worked, and that's great but wanted to understand it better. Sometimes details get lost when trying to communicate the questions correctly and really doesn't come across at all. Sorry for any confusion. Thanks Again.!!! – Boomer Jul 27 '20 at 20:51