1

I am working on a project that will perform an audit of employees with computer accounts. I want to print one data frame with the two new columns in it. This is different from the Comparing Columns in Dataframes question because I am working with strings. I will also need to do some fuzzy logic but that is further down the line.

The data I receive is in Excel sheets. It comes from two sources that I don't have control over and so I format them to be [First Name, Last Name] and print them to the console to ensure the data I am working with is correct. I convert the .xls to .csv files, format the information and am able to output the two lists of names in a single dataframe with two columns but have not been able to put the values I want in the last two columns. I have used query (which returned True/False, not the names), diff and regex. I assume that I am just using the tools incorrectly.

    import pandas as pd

    nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary 
          Emerson","Amelia H. Hayden","Abraham Oliver"],
          'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ 
          McMahon","Amelia H. Hayden"]}
    info = pd.DataFrame(data=nd)

    for row in info:
    if info.col1.value not in info.col2:
        info["Need Account"] = info.col1.value

    if info.col2.value not in info.col1:
        info["Delete Account"] = info.col2.value

    print(info)

What I would like is a new dataframe with 2 columns: Need Account and Delete Account and fill in the appropriate values based on the other columns in the dataframe. In this case, I am getting an error that 'Series' has not attribute 'value'. Here is an example of my expected output:

    df_out: 
    Need Account       Delete Account
    Demetrius McMahon  Abe Oliver
    Abraham Oliver     Hillary Emerson
    Hilary Emerson     DJ McMahon

From this list I can look to see who's nickname showed up and pare the list down from there.

IrishJohn
  • 13
  • 3
  • 3
    what is your expected output from the sample data? – Quang Hoang Jul 03 '19 at 19:24
  • Possible duplicate of [Comparing two columns in pandas dataframe to create a third one](https://stackoverflow.com/questions/41440068/comparing-two-columns-in-pandas-dataframe-to-create-a-third-one) – G. Anderson Jul 03 '19 at 19:34
  • I don't think `for row in info` is doing what you think it is. Try changing the contents of the loop to `for row in infor: print(row)` to check the output. – G. Anderson Jul 03 '19 at 19:35
  • The expected output is the addition of two more columns, Need Account and Delete Account. I will update the question with the expected output – IrishJohn Jul 03 '19 at 19:36

3 Answers3

0

I'm taking a chance without seeing your expected output, but reading what you are attempting in your code. Let me know if this is what you are looking for?

nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
      'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"], 
      'Need Account':"", 
      'Delete Account':""
     }
info = pd.DataFrame(data=nd)

print(info)

               col1              col2 Need Account Delete Account
0     Abraham Hansen    Abraham Hansen                            
1  Demetrius McMahon        Abe Oliver                            
2     Hilary Emerson   Hillary Emerson                            
3   Amelia H. Hayden        DJ McMahon                            
4     Abraham Oliver  Amelia H. Hayden    

Don't use loops, use vectors...

info.loc[info['col1'] != info['col2'], 'Need Account'] = info['col1']
info.loc[info['col2'] != info['col1'], 'Delete Account'] = info['col2']

print(info)

               col1              col2       Need Account    Delete Account
0     Abraham Hansen    Abraham Hansen                                     
1  Demetrius McMahon        Abe Oliver  Demetrius McMahon        Abe Oliver
2     Hilary Emerson   Hillary Emerson     Hilary Emerson   Hillary Emerson
3   Amelia H. Hayden        DJ McMahon   Amelia H. Hayden        DJ McMahon
4     Abraham Oliver  Amelia H. Hayden     Abraham Oliver  Amelia H. Hayden
run-out
  • 3,114
  • 1
  • 9
  • 25
  • That's very close! I changed the question as I thought it might be too difficult to just put the two other columns in. Thank you! I am going to work on a method of iterating through each list to compare them (in the example Amelia would be excluded from the Need/Delete account columns). Thank you again! – IrishJohn Jul 03 '19 at 19:48
  • If you must loop, you can use [iterrows](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html), but pandas strength really is in vector solutions and I'm sure your problem can be solved in this way. – run-out Jul 03 '19 at 19:52
  • I will focus on Vectors then. I am teaching myself Python via finding projects at work. I really appreciate your help. How do I mark your answer as being the one that answered my question? – IrishJohn Jul 03 '19 at 19:54
  • It will show up on the left after enough time lapses. My biggest learning in pandas was to stop thinking in terms of traditional loops and start thinking in terms of vectors. You will find that this opens up a whole new way of thinking. Last year I made a large program using pandas but stuck in loops. It was aweful. Once the light went on understanding vectors, my whole world changed. Keep at the vector angle, it will come! – run-out Jul 03 '19 at 19:58
0

You want to use isin and np.where to conditionally assign the new values:

info['Need Account'] = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
info['Delete Account'] = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)

                col1              col2       Need Account   Delete Account
0     Abraham Hansen    Abraham Hansen                NaN              NaN
1  Demetrius McMahon        Abe Oliver  Demetrius McMahon       Abe Oliver
2     Hilary Emerson   Hillary Emerson     Hilary Emerson  Hillary Emerson
3   Amelia H. Hayden        DJ McMahon                NaN       DJ McMahon
4     Abraham Oliver  Amelia H. Hayden     Abraham Oliver              NaN

Or if you want a new dataframe like you stated in your question:

need = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
delete = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)

newdf = pd.DataFrame({'Need Account':need,
                      'Delete Account':delete})

        Need Account   Delete Account
0                NaN              NaN
1  Demetrius McMahon       Abe Oliver
2     Hilary Emerson  Hillary Emerson
3                NaN       DJ McMahon
4     Abraham Oliver              NaN
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • I tried using np.where in earlier versions of my project, I couldn't get it to work like this. Thank you for taking the time to help me. – IrishJohn Jul 03 '19 at 19:59
  • Sure no problem, so did that answer your question? – Erfan Jul 03 '19 at 20:16
  • It did answer my question. I am going to do some more looking into how NumPy implements "where" so that I have a better understanding of what is happening in your answer. There is always more to learn. – IrishJohn Jul 05 '19 at 13:18
0

IIUC, it doesn't seem like there is much 'structure' to be maintained from your input dataframe, so you could use sets to compare membership in groups directly.

nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
      'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"]}
df = pd.DataFrame(data=nd)

col1 = set(df['col1'])
col2 = set(df['col2'])

need = col1 - col2
delete = col2 - col1

print('need = ', need)
print('delete =  ', delete)

yields

need =  {'Hilary Emerson', 'Demetrius McMahon', 'Abraham Oliver'}
delete =   {'Hillary Emerson', 'DJ McMahon', 'Abe Oliver'}

You could then place in a new dataframe:

data = {'need':list(need), 'delete':list(delete)}
new_df = pd.DataFrame.from_dict(data, orient='index').transpose()

(Edited to account for possibility that need and delete are of unequal length.)

Brendan
  • 3,901
  • 15
  • 23
  • You are correct in that I don't need to maintain the structure. It is more about having the information reduced from the current excel file of 600+ names to the ones that don't match. Thank you for taking some time to help me. – IrishJohn Jul 03 '19 at 20:08