0

I want to create a third column in my pandas dataframe that is based on cols 1 and 2. They are always matching, but I want to make it so that the third column takes whichever value is available. If I just go off of id1, sometimes it is blank, so the third col will end up being blank as well. I want it so that it will take whichever one isn't blank to create the college name.

Original:

    id1     id2            
0   ID01   ID01             
1          ID03            
2   ID07                   
3   ID08   ID08            

Desired:

    id1     id2            college_name
0   ID01   ID01             College1
1          ID03             College3
2   ID07                    College7
3   ID08   ID08             College8

Also, one thing about this data frame is that I am pretty sure the first 2 columns either are an exact match or one of them is empty. I would like to double-check if there is an instance where id1 and id2 are completely different numbers in the same row. How should I do that?

youtube
  • 265
  • 1
  • 7
  • Does this answer your question? [Coalesce values from 2 columns into a single column in a pandas dataframe](https://stackoverflow.com/questions/38152389/coalesce-values-from-2-columns-into-a-single-column-in-a-pandas-dataframe) – Nick ODell Jul 19 '22 at 22:35
  • The question I linked doesn't mention this, but to use `combine_first()`, you'll need to replace the blank entries with np.nan. – Nick ODell Jul 19 '22 at 22:36

1 Answers1

1

Backfill values from id2 to id1. Extract the numbers. Convert to int then str.

Given:

    id1   id2
0  ID01  ID01
1   NaN  ID03
2  ID07   NaN
3  ID08  ID08

Doing:

df['college_name'] = 'College' + (df.bfill(axis=1)['id1']
                                    .str.extract('(\d+)')
                                    .astype(int)
                                    .astype(str))

Output:

    id1   id2 college_name
0  ID01  ID01     College1
1   NaN  ID03     College3
2  ID07   NaN     College7
3  ID08  ID08     College8

To check for rows where the ids are different:

Given:

    id1   id2
0  ID01  ID01
1   NaN  ID03
2  ID07   NaN
3  ID08  ID98

Doing:

print(df[df.id1.ne(df.id2) & df.id1.notna() & df.id2.notna()])

Output:

    id1   id2
3  ID08  ID98
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • Hi, I just edited my question to be more clear since you didn't really answer my question. – youtube Jul 20 '22 at 05:58
  • So I tried this, but it isn't working for me- I start off with no NA values, just empty values. Could this be the issue? When I try to add the na values, it doesn't work. – youtube Jul 20 '22 at 06:23
  • 1
    @youtube `df = df.replace('', np.nan)` – BeRT2me Jul 20 '22 at 06:24