0

I'm fairly new to beautiful soup/Python/Web Scraping and I have been able to scrape data from a site, but I am only able to export the very first row to a csv file ( I want to export all scraped data into the file.)

I am stumped on how to make this code export ALL scraped data into multiple individual rows:

r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content  # Content of response
soup = BeautifulSoup(data, "html.parser")


for span in soup.find_all("span", {"class": "article"}):
    for link in span.select("a"):
           
        name_and_date = link.text.split('(')
        name = name_and_date[0].strip()
        date = name_and_date[1].replace(')','').strip()
        
        base_url = "https://www.infoplease.com"
        links = link['href']
        links = urljoin(base_url, links)
        
        
    
    pres_data = {'Name': [name],
                'Date': [date],
                'Link': [links]
                }
        
    df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])

    df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)

    print (df)

Any ideas here? I believe I need to loop it through the data parsing and grab each set and push it in. Am I going about this the right way?

Thanks for any insight

quizno
  • 13
  • 2
  • Does this answer your question? [How to add pandas data to an existing csv file?](https://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file) – Macattack Feb 23 '21 at 00:23

2 Answers2

1

The way it is currently set up, it looks like you are not adding each link as a new entry and instead it is only adding the last link. If you initialize a list and add a dictionary like you have it set up for each iteration of the "links" for loop, you will add each row and not just the last one.

import pandas as pd 
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content  # Content of response
soup = BeautifulSoup(data, "html.parser")

pres_data = []
for span in soup.find_all("span", {"class": "article"}):
    for link in span.select("a"):
        
        name_and_date = link.text.split('(')
        name = name_and_date[0].strip()
        date = name_and_date[1].replace(')','').strip()
        
        base_url = "https://www.infoplease.com"
        links = link['href']
        links = urljoin(base_url, links)
        
        this_data = {'Name': name,
                    'Date': date,
                    'Link': links
                    }
        pres_data.append(this_data)
        
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])

df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)

print (df)
j-berg
  • 87
  • 6
0

You don't need to use Pandas here since you are not willing to apply any kind of Data operation there!

Usually try to limit yourself on the builtin libraries in case if the task is shorter.

import requests
from bs4 import BeautifulSoup
import csv


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    target = [([x.a['href']] + x.a.text[:-1].split(' ('))
              for x in soup.select('span.article')]
    with open('data.csv', 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['Url', 'Name', 'Date'])
        writer.writerows(target)


main('https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses')

Sample of output:

enter image description here