1

I am working on a dataset from the Federal Elections Commission on US House of Representative campaign spending in the 2018 Midterms. Because it contains all candidates (including those who lost primaries) and I am only interested in general election Democratic candidates, I'm trying to reduce the dataset to nominated Democrats. I've assumed that these will be the Democrats with the highest spending in each district.

The starting dataset is here: https://www.fec.gov/data/candidates/house/?election_year=2018&election_full=True&party=DEM&is_active_candidate=true

I have created a new column, state_dist, which is a combination of state and district with the latter turned to int, then str. This was then sorted by state and district to the following:

enter image description here

I also have a stacked version, stacking by state and district:

enter image description here

I have tried borrowing code from Bharath's answer here. It creates the following:

enter image description here

The problem is that max_in_district appears to return the right values, but I need the candidate name and other variables as well, to attach this to my main project dataframe. I tried adding name in max_in_district_2, but this doesn't return the result I'm looking for.

To be clear, I want the full row in each "state_dist" value which has the highest value in the "disbursements" column. How should I get this?

Thank you in advance.

Demosthenes
  • 111
  • 1
  • 9

2 Answers2

1

Find the idxmax and then merge the data back to the original frame:

df.set_index('name').groupby(['state_dist'], as_index=False)['disbursements'].idxmax()\
.merge(df, right_on=['name', 'state_dist'], 
       left_on=['disbursements', 'state_dist'])
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
1

You can do:

df.loc[
    df.groupby("state_dist", as_index=False)["disbursements"].idxmax()[
        "disbursements"
    ]
].to_csv("out.csv")

This creates out.csv:

enter image description here

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91