Looking to merge/concatenate/groupby different rows in Pandas dataframe

Question

I will be iterating through a large list of dataframes of baseball statistics of different players. This data is indexed by year. What I am looking to do is group year while keeping salary the same and adding WAR. Also, I am looking to drop rows that are not single years. In my data set these entries are strings.

to group

for x in clean_stats_list:
    x.groupby("Year")

to eliminate rows

for x in clean_stats_list:
    for i in x['Year']:
        if len(i) > 4:
            x['Year'][i].drop()

      WAR         Year       Salary
 0    1.4         2008     $390,000
 1    0.9         2009     $418,000
 2    2.4         2010     $445,000
 3    3.6         2011   $3,400,000
 4    5.2         2012   $5,400,000
 5    1.3         2013   $7,400,000
 6    6.8         2014  $10,000,000
 7    3.8         2015  $10,000,000
 9    0.2         2015  $10,000,000
 11   5.5         2016  $15,833,333
 12   2.0         2017  $21,833,333
 13   1.3         2018  $21,833,333
 14  34.3   11 Seasons  $96,952,999
 16  25.4  CIN (8 yrs)  $37,453,000
 17   8.8  SFG (3 yrs)  $59,499,999

This is what I am expecting to achieve:

     WAR         Year       Salary
 0    1.4         2008     $390,000
 1    0.9         2009     $418,000
 2    2.4         2010     $445,000
 3    3.6         2011   $3,400,000
 4    5.2         2012   $5,400,000
 5    1.3         2013   $7,400,000
 6    6.8         2014  $10,000,000
 7    4.0         2015  $10,000,000
 11   5.5         2016  $15,833,333
 12   2.0         2017  $21,833,333
 13   1.3         2018  $21,833,333

Provide a snippet of your dataframe & expected output in proper format (make use of HTML snippet). That dataframe the way it is now is not readable. — moys, Aug 14 '19 at 04:04

score 1 · Answer 1 · answered Aug 14 '19 at 04:18

1

To filter out based on length of column Year, why don't you try creating a mask and then select based on it.

Code:

mask_df = your_df['Year'].str.len() == 4
your_df_cleaned = your_df.loc[mask_df]

answered Aug 14 '19 at 04:18

Nidhin Bose J.

1,092
15
28

score 0 · Accepted Answer · answered Aug 14 '19 at 05:27

You can use regex for validate years for avoid filter values with length 4 and not years with Series.str.contains and boolean indexing:

#https://stackoverflow.com/a/4374209
#validate between 1000-2999
df1 = df[df['Year'].str.contains('^[12][0-9]{3}$')]
#validate between 0000-9999
#df1 = df[df['Year'].str.contains('^\d{4}$')]
print (df1)
    WAR  Year       Salary
0   1.4  2008     $390,000
1   0.9  2009     $418,000
2   2.4  2010     $445,000
3   3.6  2011   $3,400,000
4   5.2  2012   $5,400,000
5   1.3  2013   $7,400,000
6   6.8  2014  $10,000,000
7   3.8  2015  $10,000,000
9   0.2  2015  $10,000,000
11  5.5  2016  $15,833,333
12  2.0  2017  $21,833,333
13  1.3  2018  $21,833,333

Looking to merge/concatenate/groupby different rows in Pandas dataframe

to group

to eliminate rows

2 Answers2