0

Problem Summary: Given a DataFrame pDF that contains basic data on the names, instruments, and birth dates and death dates of jazz musicians, I want to create a column that either calculates the lifespan of the musicians who have already died, or returns 'N/A' for musicians who have not died or for whom there is no data on.

Explanation: I'm doing a webscraping and data analysis project where I scrape Wikipedia for basic information (names, instruments, birth dates and death dates) on jazz musicians in order to create a database and do some data analysis. I've managed to build a script that works for the list of pianists so far, and as a test I've built a DataFrame pDF with the first 30 musicians on the "List of jazz musicians" Wikipedia page. Here is the current state of pDF:

                  Artist   Instrument   Birthdate   Deathdate Birth Year  \
0        Irving Aaronson        Piano        1895  1963-03-10       1895   
1           Anders Aarum        Piano  1974-12-17     Present       1974   
2             Mike Abene        Piano        1942     Present       1942   
3              Don Abney        Piano        1923  2000-01-20       1923   
4         Chris Abrahams        Piano        1961     Present       1961   
5   Muhal Richard Abrams        Piano  1930-09-19  2017-10-29       1930   
6      John Adriano Acea        Piano        1917        1963       1917   
7           Beegie Adair        Piano  1937-12-11     Present       1937   
8              Kei Akagi        Piano  1953-03-16     Present       1953   
9       Toshiko Akiyoshi        Piano  1929-12-12     Present       1929   
10     Erling Aksdal Jr.        Piano  1953-02-18     Present       1953   
11            Joe Albany        Piano  1924-01-24  1988-01-12       1924   
12            Tony Aless        Piano  1921-08-28  1988-01-11       1921   
13     Charlie Alexander        Piano        1890        1970       1890   
14        Joey Alexander        Piano  2003-06-25     Present       2003   
15       Monty Alexander        Piano  1944-06-06     Present       1944   
16      Sinan Alimanović        Piano  1954-02-11     Present       1954   
17          Esther Allan        Piano        1914        1985       1914   
18            Geri Allen        Piano        1957        2017       1957   
19           Steve Allen        Piano        1921        2000       1921   
20          Mose Allison        Piano  1927-11-11  2016-11-15       1927   
21       Mikhail Alperin        Piano  1956-11-07  2018-05-11       1956   
22           Helio Alves        Piano        1966     Present       1966   
23          Jimmy Amadie        Piano        1937        2013       1937   
24         Albert Ammons        Piano  1907-03-01  1949-12-02       1907   
25       Franck Amsallem        Piano  1961-10-25     Present       1961   
26        Chris Anderson        Piano  1926-02-26  2008-02-04       1926   
27         Bill Anschell        Piano         N/A         N/A        N/A   
28              Jim Aton  Double bass        1925        2008       1925   
29         Ivar Antonsen        Piano  1946-04-16     Present       1946   

   Death Year Lifespan  
0        1963      NaN  
1     Present      NaN  
2     Present      NaN  
3        2000      NaN  
4     Present      NaN  
5        2017      NaN  
6        1963      NaN  
7     Present      NaN  
8     Present      NaN  
9     Present      NaN  
10    Present      NaN  
11       1988      NaN  
12       1988      NaN  
13       1970      NaN  
14    Present      NaN  
15    Present      NaN  
16    Present      NaN  
17       1985      NaN  
18       2017      NaN  
19       2000      NaN  
20       2016      NaN  
21       2018      NaN  
22    Present      NaN  
23       2013      NaN  
24       1949      NaN  
25    Present      NaN  
26       2008      NaN  
27        N/A      NaN  
28       2008      NaN  
29    Present      NaN

All of the information in the table except for the NaNs are strings.

(I understand that Birthdate and Birth Year and Deathdate and Death Year are keeping essentially the same information... this is part of my project I am going to work out later, but I haven't decided what I want to do yet. Additionally, an 'N/A' in the DataFrame means that the information could not be found by the web-scraping script)

Using the year dates that are in the Birth Year and Death Year columns, I want to fill in the values for the Lifespan column so that:

  • If the Death Year column contains a (four digit) number, the Lifespan column for that row is filled with the basic calculation Death Year - Birth Year.
  • If the Death Year column does not contain a (four digit) number, the Lifespan column for that row is filled with 'N/A' (or NaN to be consistent with DataFrame syntax - something I also have to change in my script later)

Here is the solution that I came up with:

for index, row in pDF.iterrows():
    if (row['Birth Year'] != 'N/A') & (row['Death Year'] != 'Present'):
        row['Birth Year'] = int(row['Birth Year'])
        row['Death Year'] = int(row['Death Year'])
        row['Lifespan'] = row['Death Year'] - row['Birth Year']

My solution is different from the logic I described above because I wasn't exactly sure how to implement the logic.

Here is my main question:

  • Is the logic that I described in the bullet points the most efficient and pythonic way I could go about accomplishing what I'm trying to do? If not, what's the better way? I was also trying to find a way to re-type all of the four digit numbers in the Birth Year and Death Year columns as integers, but I couldn't find a good-looking way to do that either.

Thank you!

1 Answers1

0

I would keep the Present labels to the side for re-integration later. For the actual calculation, I would force everything into date-times:

df['bday_dt'] = pd.to_datetime(df['Birthdate'], errors='coerce')
df['dday_dt'] = pd.to_datetime(df['Deathdate'], errors='coerce')

df['age_yrs'] = (df['bday_dt'] - df['dday_dt']).astype('timedelta64[Y]').astype(float)

Then use a column, perhaps alive, constructed from your Deathdate column to say whether the musician is alive or not. From there, if you want, you could overwrite the age column with np.where with the alive data.

ifly6
  • 5,003
  • 2
  • 24
  • 47
  • Why would you recommend doing this over my solution? I'm not attached to my solution at all, I'm just not experienced enough with Python to know why your solution is the more efficient / effective one. – Nat Porter Sep 18 '20 at 19:55
  • Acting over columns is faster than acting over rows due to vectorisation. https://stackoverflow.com/questions/47755442/what-is-vectorization – ifly6 Sep 18 '20 at 21:22