Trying to iterate over rows of dataframe but only accessing column headers

Question

I have a dataset of US House of Representatives races. While there are 435 districts, I have 439 Democratic candidates, which is too many, and I'm trying to figure out why. I suspect there are runoff races causing this, which I want to test.

>>> democrat_results.head()
    states  po  dist    cand                party       cand_votes  tot_votes
0   ALABAMA AL  1       ROBERT KENNEDY JR.  DEMOCRAT    89226       242617
4   ALABAMA AL  2       TABITHA ISNER       DEMOCRAT    86931       226230
7   ALABAMA AL  3       MALLORY HAGAN       DEMOCRAT    83996       231915
10  ALABAMA AL  4       LEE AUMAN           DEMOCRAT    46492       230969
13  ALABAMA AL  5       PETER JOFFRION      DEMOCRAT    101388      260673

What I'm trying to do is see if any of the state districs (eg. AL 1, AL 2) have two listings. I can figure out how to do this on my own, but the problem I'm having is that whenever I write a for loop to act on the dataframe, it seems to just act on the column headers.

unique_races = []

for row in democrat_results[1:]:
   if row not in unique_races:
        unique_races.append(row)
# this was "row" and "race", now has been changed to just be row in both cases

unique_races returns:

['states', 'po', 'dist', 'cand', 'party', 'cand_votes', 'tot_votes']

(I am aware that the loop won't do what I'm looking for, it's just to demonstrate what happens)

How do I avoid this and instead have the for loop act on the rows?

I am aware that for loops are inefficient, but I'm only using a few hundred values and do not know more advanced methods, making it suitable enough for me.

Rest assured that I have spent time looking for an answer, and not found one, leading to me asking this question.

Does this answer your question? [Pandas: Selecting rows based on value counts of a particular column](https://stackoverflow.com/questions/36166090/pandas-selecting-rows-based-on-value-counts-of-a-particular-column) — Michael Delgado, Feb 21 '21 at 20:11
Also, in a very different way, this may answer your question? Ballotpedia's [Top-two primary](https://ballotpedia.org/Top-two_primary) and [Louisiana majority vote system](https://ballotpedia.org/Louisiana_majority-vote_system) articles detail how in California and Washington State, voters cast ballots in an open primary, with the top two candidates, regardless of party, advancing to the general; in Louisiana, all contenders participate in the general, with runoff elections for the top two candidates held if any candidate fails to secure a majority. — Michael Delgado, Feb 21 '21 at 20:23
@MichaelDelgado I think I've figured out what I'm going to do: 1. Create a State-District variable concatenating the "po" and "dist" columns of each entry 2. Create a frequency table of State-District 3. Examine the rows with State-District > 1 Finding the runoffs is a short-term problem but the long-term one I'm struggling with is why my for loop is just seeming to get the column names and not the data in the rows. Can you give any advice on how to access the rows? Thank you. — Demosthenes, Feb 21 '21 at 20:55
It's hard for me to debug your for loop because it doesn't actually have all the relevant code in it. For example, where is `race` defined? Please create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) to get the best quality answers. Ideally, a question should include all of the code necessary to produce the issue you're facing. Thanks! :) — Michael Delgado, Feb 21 '21 at 21:03
Ahh, I'm sorry! I seem to be using "race" and "result" in different loops, and here I used both. Sometimes I'm even using "row". But for all intents and purposes they are the same, just used as the indexer for loops, not as variables set up. I've corrected the for loop in the answer, though it still has the same problem. On another answer I've found the .iterrows() function, which seems to get what I need: cal = [] for index, row in democrat_results.iterrows(): #if row not in unique_races: # unique_races.append(row) if str(row[1]) == "CA": cal.append(row) — Demosthenes, Feb 21 '21 at 21:14
The code there iterates through all of them and gets the ones with CA in the "po" column, which shows how I need to do others. Thanks for the help, it's appreciated! — Demosthenes, Feb 21 '21 at 21:15

Michael Delgado · Answer 1 · 2021-02-23T06:32:49.390

your suggested groupby ("po", "dist") suggestion is on the right track. The issue is comparing the actual items you want from each row. When you loop through the DataFrame, you're getting the entire row. You can see this by inspecting the first item returned by df.iterrows():

In [22]: next(df.iterrows())
Out[22]:
(0,
 states                   ALABAMA
 po                            AL
 dist                           1
 cand          ROBERT KENNEDY JR.
 party                   DEMOCRAT
 cand_votes                 89226
 tot_votes                 242617
 Name: 0, dtype: object)

Along with the index, 0, we have the full row. Since each index and row value is unique in this dataframe, this will never have any match elsewhere in the dataframe. Therefore, your test, if row not in unique_races will never return False. Instead, you want to check for the values of po and dist:

for ix, row in df.iterrows():
    race = (row['po'], row['dist'])  # this produces a tuple, e.g. ('AL', 1)
    
    if race not in unique_races:
        unique_races.append(row)

This will give you a list with 1 row from each unique race. However, it seems like you're trying to identify the races with two or more candidates. Additionally, as you suggested, this isn't the fastest way to do this.

The compound index ('po', 'dist') does make this tricky. To simplify things, we can add a unique dist_id column:

# convert dist to str so it can be added to the string PO code
df_with_distid = df.assign(dist_id=(df['po'] + df['dist'].astype(str)))

Now, we can count the occurrences of unique dist_ids, and find dist_ids with more than one candidate:

counts_by_distid = df_with_distid.groupby('dist_id').size()
more_than_1 = counts_by_distid[counts_by_distid > 1]

Finally, we can subset the full frame to those races with more than one Democratic candidate:

df_with_distid[df_with_distid.dist_id.isin(more_than_1.index)]

To test this out, I modified your subset of the data to include an extra entry:

In [32]: df = pd.DataFrame(
    ...:     columns=['states', 'po', 'dist', 'cand', 'party', 'cand_votes', 'tot_votes'],
    ...:     data=[
    ...:         ['ALABAMA', 'AL', '1', 'ROBERT KENNEDY JR.', 'DEMOCRAT', '89226', '242617'],
    ...:         ['ALABAMA', 'AL', '2', 'TABITHA ISNER', 'DEMOCRAT', '86931', '226230'],
    ...:         ['ALABAMA', 'AL', '3', 'MALLORY HAGAN', 'DEMOCRAT', '83996', '231915'],
    ...:         ['ALABAMA', 'AL', '4', 'LEE AUMAN', 'DEMOCRAT', '46492', '230969'],
    ...:         ['ALABAMA', 'AL', '5', 'PETER JOFFRION', 'DEMOCRAT', '101388', '260673'],
    ...:         ['ALABAMA', 'AL', '5', 'JANE DOE', 'DEMOCRAT', '159285', '260673'],
    ...: ])

The above code results in the following:

In [33]: df_with_distid = df.assign(dist_id=(df['po'] + df['dist'].astype(str)))

In [34]: counts_by_distid = df_with_distid.groupby('dist_id').size()
    ...: more_than_1 = counts_by_distid[counts_by_distid > 1]

In [35]: df_with_distid[df_with_distid.dist_id.isin(more_than_1.index)]
Out[35]:
    states  po dist            cand     party cand_votes tot_votes dist_id
4  ALABAMA  AL    5  PETER JOFFRION  DEMOCRAT     101388    260673     AL5
5  ALABAMA  AL    5        JANE DOE  DEMOCRAT     159285    260673     AL5

n c · Answer 2 · 2021-02-21T23:53:28.917

code:

class Data:
    results = [ 
    "states  po  dist    cand                party       cand_votes  tot_votes",
    "0   ALABAMA AL  1       ROBERT KENNEDY JR.  DEMOCRAT    89226       242617",
    "4   ALABAMA AL  2       TABITHA ISNER       DEMOCRAT    86931       226230",
    "4   ALABAMA AL  2       TABITHA ISNER       DEMOCRAT    86931       226230",
    "4   ALABAMA AL  2       TABITHA ISNER       DEMOCRAT    86931       226230",
    "7   ALABAMA AL  3       MALLORY HAGAN       DEMOCRAT    83996       231915",
    "10  ALABAMA AL  4       LEE AUMAN           DEMOCRAT    46492       230969",
    "13  ALABAMA AL  5       PETER JOFFRION      DEMOCRAT    101388      260673",
    "10  ALABAMA AL  4       LEE AUMAN           DEMOCRAT    46492       230969",
    ]

done = []
def search_data(po, dist):
    results = []
    for line in Data.results[1:]:
            line = str(line)
            line = line.split(' ')
            line = ' '.join(x for x in line if x != '')
            line = line.split(' ')
            if po == line[2]:
                if dist == line[3]:
                    line = ' '.join(x for x in line)
                    results.append(line)

    try:
        d = results[1]
        if d not in done:
            done.append(d)
            print('\n==============================================')
            for data in results:
                print(data)
            print('==============================================')
    except:
        # print('result has only 1 result in it. No dupes.')
        pass


for line in Data.results[1:].copy():
    line = str(line)
    line = line.split(' ')
    line = ' '.join(x for x in line if x != '')
    line = line.split(' ')
    po = line[2]
    dist = line[3]
    search_data(po, dist)

output:

==============================================
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
==============================================

==============================================
10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969
10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969
==============================================

This code above will loop through the dataset and index the PO and DIST of all lines and then print out in a group all the dupes. If no dupes nothing will be printed out for that particuler PO and DIST.

So basically running the code on your dataset should print out all the dupes that you want in a group with ================= and \n between to make veiwing easier.

:)

Trying to iterate over rows of dataframe but only accessing column headers

2 Answers2