How do I adjust the way my scraped table looks on a .csv?

Question

All I want in my table webscrape is "Player, Age, Market Value, Nationality, Left, Joined, and Fee, but when I turn it into a .csv it does not turn it into a table. Here is my code:

    import requests
    from bs4 import BeautifulSoup
    url = 'https://www.transfermarkt.com/transfers/transferrekorde/statistik?saison_id=&land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=&plus=1'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
    response = requests.get(url,headers=headers)
    response.status_code  #want this to be 200
    response.content
    soup = BeautifulSoup(response.content, 'html.parser')
    player_table = soup.find_all('table', class_='items')
    len(player_table)
    player_table = player_table[0]
    for row in player_table.find_all('tr'):
         for cell in row.find_all('th'):
             print(cell.text)
    for row in player_table.find_all('tr'):
        for cell in row.find_all('td'):
            (cell.text)
    with open('soccer_stats.csv', 'w') as r:
        for row in player_table.find_all('tr'):
            for cell in row.find_all('td'):
                r.write(cell.text.ljust(20))
            r.write('\n')

Does anyone know how to make this into a readable table on a .csv? Currently the whole .csv looks like this

AaronS · Answer 1 · 2020-07-23T22:00:50.113

Here's what I would do instead of using nested for loops. This table was abit tricky as it had nested tables within it bizarrely. I also don't really like beautifulsoup for grabbing table data and only use it as a last resort.

The problems you're having is that within each row, there are multiple nested elements and so it's taking 21 items to actually collect all the data for a row of the table. I also think you'd be better off using the CSV module that python provides to make your CSV files. Essentially writing CSV, we work in rows, writing each row, using for loops.

Code Example

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://www.transfermarkt.com/transfers/transferrekorde/statistik?saison_id=&land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=&plus=1'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
response = requests.get(url,headers=headers)
response.status_code  #want this to be 200
response.content
soup = BeautifulSoup(response.content, 'html.parser')
player_table = soup.find_all('table', class_='items')
len(player_table)
player_table = player_table[0]

rows = []
for row in player_table.select('tbody > tr'):
    num = row.select('td')[0].get_text()
    name = row.select('td')[3].get_text().replace('\n','')
    position = row.select('td')[4].get_text()
    age = row.select('td')[5].get_text()
    market_value = row.select('td')[6].get_text()
    season = '20' + row.select('td > a')[1].get_text().strip()
    left_club = row.select('td > a')[3].get_text()
    left_league = row.select('td > a')[4].get_text()
    joined_club = row.select('td > a')[6].get_text()
    joined_league = row.select('td > a')[7].get_text()
    fee = row.select('td > a')[8].get_text()
    row = [num,name,position,age,market_value,season,left_club,left_league,joined_club,joined_league,fee]
    rows.append(row)
with open('soccer_stats.csv','w', newline='') as r:
    file = csv.writer(r)
    columns = ['Number','Name','Position','Age','Market Value','Season Left','Club Left','League Left','Club Joined','League Joined','Transfer Fee']
    file.writerow(columns)
    
    for row in rows:
        file.writerow(row)

Explanation

I've slightly tidied the code up, instead of using find_all, I tend to find select method gives me more specificity, it generates a list and we can specify within that list what piece of data we want. Ultimately We want to put each row of data into a list that we can then loop over.

So we're grabbing all the relevant row data, but because the table is poorly set up we're having to do this using specific list items, it's a laborious task it's why I prefer alternative means to grabbing table data. I'm then adding each row list to another list. This is because the CSV module accepts rows of data. So I want to loop over each row and write each row one at a time to the CSV file.

To open write the CSV file, we're putting it in write mode, but we're also specifying `newline =''`. This is because when writing CSV files, it automatically adds a new blank row for every row written to the CSV. We have to specify NOT to do this.

We use the csv.writer method to enable us to write data to the csv we have choosen.

We then specify the columns we want and write that column using the writer.writerow method we can write our lists as rows to the CSV file.

We then use our rows lis, that is the nested lists of rows and loop over each one. We write each row using the writerow method.

Potential Alternative

I mentioned I only use beautifulsoup as a last resource. The package pandas can read HTML and grab table data. Unfortunately it wasn't working for this website. But as a pure example of how much more concise it could you would do something like the below.

import requests
import pandas as pd

url = 'https://www.transfermarkt.com/transfers/transferrekorde/statistik?saison_id=&land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&w_s=&plus=1'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
response = requests.get(url,headers=headers)
df = pd.read_html(url)

Using the pandas read_html method grabs table data easily, no always fool proof but it's my go to for grabbing table data from HTML.

See here for more details on this method.

Additional Resources

CSV Module Python Docs - For looking up the individual methods to get a feel for it.

Tutorial on CSV with python

Updated Code as per comments

To code this for all results, it's just a matter of a putting them into seperate functions. I don't think the code is big enough to worry about classes really so haven't gone down that route. The key is to try make your functions do one thing.

import requests
from bs4 import BeautifulSoup
import csv

def get_page(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
    response = requests.get(url,headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def get_table(soup):
    rows = []
    player_table = soup.find_all('table', class_='items')
    player_table = player_table[0]
    for row in player_table.select('tbody > tr'):
        num = row.select('td')[0].get_text()
        name = row.select('td')[3].get_text().replace('\n','')
        position = row.select('td')[4].get_text()
        age = row.select('td')[5].get_text()
        market_value = row.select('td')[6].get_text()
        season = '20' + row.select('td > a')[1].get_text().strip()
        left_club = row.select('td > a')[3].get_text()
        left_league = row.select('td > a')[4].get_text()
        joined_club = row.select('td > a')[6].get_text()
        joined_league = row.select('td > a')[7].get_text()
        fee = row.select('td[class$="hauptlink"]')[-1].get_text()
        row = [num,name,position,age,market_value,season,left_club,left_league,joined_club,joined_league,fee]
        rows.append(row)
    return rows

def write_csv(rows,i):
    with open('soccer_stats.csv','a', newline='') as r:
        file = csv.writer(r)
        if i == 1:
            columns = ['Number','Name','Position','Age','Market Value','Season Left','Club Left','League Left','Club Joined','League Joined','Transfer Fee']
            file.writerow(columns)
            for row in rows:
                file.writerow(row)
        else:
            for row in rows:
                file.writerow(row)

if __name__ == "__main__":
    
    for i in range(1,11):
        print('Page',i)
        url = f'https://www.transfermarkt.com/transfers/transferrekorde/statistik?ajax=yw2&altersklasse=&ausrichtung=&land_id=0&leihe=&page={i}&plus=1&saison_id=&spielerposition_id=&w_s='
        soup = get_page(url)
        rows = get_table(soup)
        write_csv(rows,i)

Updated Explanation

A couple of changes here. The get_page function just gets the page and accepts the url variable. We will be looping and feeding the url to this function later.

The get_table function does exactly as before. But for some reason on the second page it wasn't picking up the fees. This may be what the comment was about. Instead I used a CSS selector 'td[class$="hauptlink"]'. This CSS selector grabs selects only td tags with hauptlink at the end of the class name. There are several with this but the last item is the one which has the fees value in it. So hence row.select('td[class$="hauptlink"]')[-1].get_text()

For the write_csv function. It's very much the same accept because we're going to looping round this multiple times, adding the columns part every single time is not going to give us the correct format for the CSV file. So we test it out, if i the number we're using to loop with, is 1, it adds the columns, if else it just adds the rows. There's probably a neater solution as I'm repeating code however it works.

So Now we get to the bottom and where we're actually looping.

Firstly it's good practice to make sure functions are called within if __name__ == "__main__":.

We create a for loop, and we know that each page the only thing that changes is the number. So using f-strings, we can add in that value using a for loop. So every iteration, the i values goes up by one each time. Getting us the desired url to scrape. Then we call the functions needed to scrape each page.

Updated Resources

For more information on if name == "main":`. here

For more information on f-string here

The code will throw exception if the number of `td` tags are not 9, surround it with `try/except` — bigbounty, Jul 23 '20 at 09:06
It's a brittle solution depending on the number of td tags within the table, but given the table structure is not changing, is it likely the row.select('td > a')[8] selector is not going to be there ? I'll admit I haven't tried the code on all 10 pages. — AaronS, Jul 23 '20 at 09:15
Thank you both! Do you know how I would be able to loop this function, so it encompasses all of the pages? — GET, Jul 23 '20 at 21:24
I've reworked the script, when you start crawling multiple pages, it's best at the very least to put them into functions. See updated code. — AaronS, Jul 23 '20 at 21:48
Could you tick my answer as the accepted one if you think it warrants it (at the left side of my answer) ? Thanks. — AaronS, Jul 24 '20 at 06:12