-2

I have a baseball stats df wherein some of the columns names are:

  1. "Games" - which indicates the number of games a player has played in that year.
  2. "Year" - which contains values between 1990 and 2020
  3. "PlayerID" - not unique. Basically shows which player played for which "Team" in what "Year"
  4. "Home Runs" - Number of nome Runs hit by that player in that year.

I want to find out:

  1. who has the max number of home runs where the number of games played is greater than 100.
  2. which column/ feature has the highest correlation with "Home Run"

Data is in the form of a csv file.

table head

Tanmoy
  • 789
  • 7
  • 14

2 Answers2

0

Looks like you can use a groupby and transform methods aggregate some of the data. There is also a pandasql library that you could import and use SQL to query the dataframe. Link to similar question: Run sql query on pandas dataframe

0
  1. df = df[df['Games'] >= 100] will keep only lines with more than 100 games. Then df['Home Runs'].max() will give you the max number.

  2. Finding the maximum value (excluding Home Runs) in df.corr()['Home Runs'] (possibly with .idxmax()) should give you the most correlated column.

NB: You might have to do some aggregation first like df = df.groupby(['PlayerID']).sum() to get career stats and not year stats but you said PlayerID is not a unique identifier so you might have to investigate this a bit more in depth. And .sum() aggregation only makes sense for Games and Home Runs so be careful if you need the Year column in other computations.

Baobab
  • 155
  • 8
  • df = df[df['Games'] > 100] -> this would not help since we are trying to figure out how many players have played >=100 games in their career, not in that year. argmax() -> this gives a number. not sure how to use it to find column/ feature name. – Tanmoy May 25 '22 at 15:16
  • Hello @Tanmoy please read the end of my post, there is a Nota Bene explaining aggregation if you need career stats. You can replace argmax with idxmax to get the index instead of the position (the number you get with argmax). However, you might need to exclude ‘Home Runs’ first as the correlation with itself will be maximum (ie equal to 1). Hope this helps? – Baobab May 27 '22 at 06:54
  • @Tanmoy is there something still not working? Let me know if I can help! Otherwise, can you upvote and accept my answer please? – Baobab May 28 '22 at 19:13
  • can you pls give me the list of queries that I need to execute to get the desired answer? I'm still stuck. – Tanmoy Jun 03 '22 at 07:10