Here's what I would do instead of using nested for loops. This table was abit tricky as it had nested tables within it bizarrely. I also don't really like beautifulsoup for grabbing table data and only use it as a last resort.
The problems you're having is that within each row, there are multiple nested elements and so it's taking 21 items to actually collect all the data for a row of the table. I also think you'd be better off using the CSV module that python provides to make your CSV files. Essentially writing CSV, we work in rows, writing each row, using for loops.
Code Example
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.transfermarkt.com/transfers/transferrekorde/statistik?saison_id=&land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=&plus=1'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
response = requests.get(url,headers=headers)
response.status_code #want this to be 200
response.content
soup = BeautifulSoup(response.content, 'html.parser')
player_table = soup.find_all('table', class_='items')
len(player_table)
player_table = player_table[0]
rows = []
for row in player_table.select('tbody > tr'):
num = row.select('td')[0].get_text()
name = row.select('td')[3].get_text().replace('\n','')
position = row.select('td')[4].get_text()
age = row.select('td')[5].get_text()
market_value = row.select('td')[6].get_text()
season = '20' + row.select('td > a')[1].get_text().strip()
left_club = row.select('td > a')[3].get_text()
left_league = row.select('td > a')[4].get_text()
joined_club = row.select('td > a')[6].get_text()
joined_league = row.select('td > a')[7].get_text()
fee = row.select('td > a')[8].get_text()
row = [num,name,position,age,market_value,season,left_club,left_league,joined_club,joined_league,fee]
rows.append(row)
with open('soccer_stats.csv','w', newline='') as r:
file = csv.writer(r)
columns = ['Number','Name','Position','Age','Market Value','Season Left','Club Left','League Left','Club Joined','League Joined','Transfer Fee']
file.writerow(columns)
for row in rows:
file.writerow(row)
Explanation
I've slightly tidied the code up, instead of using find_all, I tend to find select method gives me more specificity, it generates a list and we can specify within that list what piece of data we want. Ultimately We want to put each row of data into a list that we can then loop over.
So we're grabbing all the relevant row data, but because the table is poorly set up we're having to do this using specific list items, it's a laborious task it's why I prefer alternative means to grabbing table data. I'm then adding each row list to another list. This is because the CSV module accepts rows of data. So I want to loop over each row and write each row one at a time to the CSV file.
To open write the CSV file, we're putting it in write mode, but we're also specifying `newline =''`. This is because when writing CSV files, it automatically adds a new blank row for every row written to the CSV. We have to specify NOT to do this.
We use the csv.writer method to enable us to write data to the csv we have choosen.
We then specify the columns we want and write that column using the writer.writerow method we can write our lists as rows to the CSV file.
We then use our rows lis, that is the nested lists of rows and loop over each one. We write each row using the writerow method.
Potential Alternative
I mentioned I only use beautifulsoup as a last resource. The package pandas can read HTML and grab table data. Unfortunately it wasn't working for this website. But as a pure example of how much more concise it could you would do something like the below.
import requests
import pandas as pd
url = 'https://www.transfermarkt.com/transfers/transferrekorde/statistik?saison_id=&land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=&plus=1'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
response = requests.get(url,headers=headers)
df = pd.read_html(url)
Using the pandas read_html method grabs table data easily, no always fool proof but it's my go to for grabbing table data from HTML.
See here for more details on this method.
Additional Resources
CSV Module Python Docs - For looking up the individual methods to get a feel for it.
Tutorial on CSV with python
Updated Code as per comments
To code this for all results, it's just a matter of a putting them into seperate functions. I don't think the code is big enough to worry about classes really so haven't gone down that route. The key is to try make your functions do one thing.
import requests
from bs4 import BeautifulSoup
import csv
def get_page(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def get_table(soup):
rows = []
player_table = soup.find_all('table', class_='items')
player_table = player_table[0]
for row in player_table.select('tbody > tr'):
num = row.select('td')[0].get_text()
name = row.select('td')[3].get_text().replace('\n','')
position = row.select('td')[4].get_text()
age = row.select('td')[5].get_text()
market_value = row.select('td')[6].get_text()
season = '20' + row.select('td > a')[1].get_text().strip()
left_club = row.select('td > a')[3].get_text()
left_league = row.select('td > a')[4].get_text()
joined_club = row.select('td > a')[6].get_text()
joined_league = row.select('td > a')[7].get_text()
fee = row.select('td[class$="hauptlink"]')[-1].get_text()
row = [num,name,position,age,market_value,season,left_club,left_league,joined_club,joined_league,fee]
rows.append(row)
return rows
def write_csv(rows,i):
with open('soccer_stats.csv','a', newline='') as r:
file = csv.writer(r)
if i == 1:
columns = ['Number','Name','Position','Age','Market Value','Season Left','Club Left','League Left','Club Joined','League Joined','Transfer Fee']
file.writerow(columns)
for row in rows:
file.writerow(row)
else:
for row in rows:
file.writerow(row)
if __name__ == "__main__":
for i in range(1,11):
print('Page',i)
url = f'https://www.transfermarkt.com/transfers/transferrekorde/statistik?ajax=yw2&altersklasse=&ausrichtung=&land_id=0&leihe=&page={i}&plus=1&saison_id=&spielerposition_id=&w_s='
soup = get_page(url)
rows = get_table(soup)
write_csv(rows,i)
Updated Explanation
A couple of changes here. The get_page
function just gets the page and accepts the url variable. We will be looping and feeding the url to this function later.
The get_table
function does exactly as before. But for some reason on the second page it wasn't picking up the fees. This may be what the comment was about. Instead I used a CSS selector 'td[class$="hauptlink"]'. This CSS selector grabs selects only td tags with hauptlink at the end of the class name. There are several with this but the last item is the one which has the fees value in it.
So hence row.select('td[class$="hauptlink"]')[-1].get_text()
For the write_csv
function. It's very much the same accept because we're going to looping round this multiple times, adding the columns part every single time is not going to give us the correct format for the CSV file. So we test it out, if i the number we're using to loop with, is 1, it adds the columns, if else it just adds the rows. There's probably a neater solution as I'm repeating code however it works.
So Now we get to the bottom and where we're actually looping.
Firstly it's good practice to make sure functions are called within if __name__ == "__main__":
.
We create a for loop, and we know that each page the only thing that changes is the number. So using f-strings, we can add in that value using a for loop. So every iteration, the i values goes up by one each time. Getting us the desired url to scrape. Then we call the functions needed to scrape each page.
Updated Resources
For more information on if name == "main":`. here
For more information on f-string here