2

In the following code, there are 2 dataframes that are identically labelled (recent_grads and all_ages):

majors = recent_grads['Major'].unique()
rg_lower_count = 0
for m in majors:
    recent_grads_row = recent_grads[recent_grads['Major'] == m]
    all_ages_row = all_ages[all_ages['Major'] == m]

    rg_unemp_rate = recent_grads_row.iloc[0]['Unemployment_rate']
    aa_unemp_rate = all_ages_row.iloc[0]['Unemployment_rate']

    if rg_unemp_rate < aa_unemp_rate:
        rg_lower_count += 1

print(rg_lower_count)

Why do I need the iloc[0] part (on lines 7 and 8)? Since there is only 1 line at each series (recent grads row and all ages row) there is no need to specify on what lines I want to perform the comparison. Yet, without it I get this error message:

ValueError: Can only compare identically-labeled Series objects
Martin Ueding
  • 8,245
  • 6
  • 46
  • 92
Moran Reznik
  • 1,201
  • 2
  • 11
  • 28
  • I think there is better __vectorized__ approach... Can you post small reproducible data sets and your desired data set. Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your correspondingly. – MaxU - stand with Ukraine Aug 14 '17 at 13:44

1 Answers1

0

Using iloc means that you will always get the first row in the data frame, no matter the index value; here then you get one of the column names, so you end up with a single scalar value for each data frame. However, if you just compare two data frames (or two series obtained from their columns in this case), the comparison can only work if both have exactly the same index labels.

To see what I mean, if you print recent_grads_row.index[0] and all_ages_row.index[0] you should see different values. Another option would be to user reset_index on both data frames or something like that, but just picking the first row seems simpler here.

jdehesa
  • 58,456
  • 7
  • 77
  • 121