Problem Summary: Given a DataFrame pDF
that contains basic data on the names, instruments, and birth dates and death dates of jazz musicians, I want to create a column that either calculates the lifespan of the musicians who have already died, or returns 'N/A' for musicians who have not died or for whom there is no data on.
Explanation: I'm doing a webscraping and data analysis project where I scrape Wikipedia for basic information (names, instruments, birth dates and death dates) on jazz musicians in order to create a database and do some data analysis. I've managed to build a script that works for the list of pianists so far, and as a test I've built a DataFrame pDF
with the first 30 musicians on the "List of jazz musicians" Wikipedia page. Here is the current state of pDF
:
Artist Instrument Birthdate Deathdate Birth Year \
0 Irving Aaronson Piano 1895 1963-03-10 1895
1 Anders Aarum Piano 1974-12-17 Present 1974
2 Mike Abene Piano 1942 Present 1942
3 Don Abney Piano 1923 2000-01-20 1923
4 Chris Abrahams Piano 1961 Present 1961
5 Muhal Richard Abrams Piano 1930-09-19 2017-10-29 1930
6 John Adriano Acea Piano 1917 1963 1917
7 Beegie Adair Piano 1937-12-11 Present 1937
8 Kei Akagi Piano 1953-03-16 Present 1953
9 Toshiko Akiyoshi Piano 1929-12-12 Present 1929
10 Erling Aksdal Jr. Piano 1953-02-18 Present 1953
11 Joe Albany Piano 1924-01-24 1988-01-12 1924
12 Tony Aless Piano 1921-08-28 1988-01-11 1921
13 Charlie Alexander Piano 1890 1970 1890
14 Joey Alexander Piano 2003-06-25 Present 2003
15 Monty Alexander Piano 1944-06-06 Present 1944
16 Sinan Alimanović Piano 1954-02-11 Present 1954
17 Esther Allan Piano 1914 1985 1914
18 Geri Allen Piano 1957 2017 1957
19 Steve Allen Piano 1921 2000 1921
20 Mose Allison Piano 1927-11-11 2016-11-15 1927
21 Mikhail Alperin Piano 1956-11-07 2018-05-11 1956
22 Helio Alves Piano 1966 Present 1966
23 Jimmy Amadie Piano 1937 2013 1937
24 Albert Ammons Piano 1907-03-01 1949-12-02 1907
25 Franck Amsallem Piano 1961-10-25 Present 1961
26 Chris Anderson Piano 1926-02-26 2008-02-04 1926
27 Bill Anschell Piano N/A N/A N/A
28 Jim Aton Double bass 1925 2008 1925
29 Ivar Antonsen Piano 1946-04-16 Present 1946
Death Year Lifespan
0 1963 NaN
1 Present NaN
2 Present NaN
3 2000 NaN
4 Present NaN
5 2017 NaN
6 1963 NaN
7 Present NaN
8 Present NaN
9 Present NaN
10 Present NaN
11 1988 NaN
12 1988 NaN
13 1970 NaN
14 Present NaN
15 Present NaN
16 Present NaN
17 1985 NaN
18 2017 NaN
19 2000 NaN
20 2016 NaN
21 2018 NaN
22 Present NaN
23 2013 NaN
24 1949 NaN
25 Present NaN
26 2008 NaN
27 N/A NaN
28 2008 NaN
29 Present NaN
All of the information in the table except for the NaN
s are strings.
(I understand that Birthdate
and Birth Year
and Deathdate
and Death Year
are keeping essentially the same information... this is part of my project I am going to work out later, but I haven't decided what I want to do yet. Additionally, an 'N/A' in the DataFrame means that the information could not be found by the web-scraping script)
Using the year dates that are in the Birth Year
and Death Year
columns, I want to fill in the values for the Lifespan
column so that:
- If the
Death Year
column contains a (four digit) number, theLifespan
column for that row is filled with the basic calculationDeath Year - Birth Year
. - If the
Death Year
column does not contain a (four digit) number, theLifespan
column for that row is filled with 'N/A' (orNaN
to be consistent with DataFrame syntax - something I also have to change in my script later)
Here is the solution that I came up with:
for index, row in pDF.iterrows():
if (row['Birth Year'] != 'N/A') & (row['Death Year'] != 'Present'):
row['Birth Year'] = int(row['Birth Year'])
row['Death Year'] = int(row['Death Year'])
row['Lifespan'] = row['Death Year'] - row['Birth Year']
My solution is different from the logic I described above because I wasn't exactly sure how to implement the logic.
Here is my main question:
- Is the logic that I described in the bullet points the most efficient and pythonic way I could go about accomplishing what I'm trying to do? If not, what's the better way? I was also trying to find a way to re-type all of the four digit numbers in the
Birth Year
andDeath Year
columns as integers, but I couldn't find a good-looking way to do that either.
Thank you!