1

I'm a beginner with Python & trying to learn with a BeautifulSoup webscraping project.

I'm looking to scrape the record item title, URL of item & purchase date from this URL & export to a CSV.

I made great progress with scraping title & URL but just cannot figure out how to properly code the purchase date info correctly in my for loop (purchase_date variable below).

What's currently happening is the data in the csv file for the purchase date (e.g. p_date title) just displays blank cells with no text.. no error message just no data getting put into csv. Any guidance is much appreciated.

Thank you!!


import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

headers = {"Accept-Language": "en-US, en;q=0.5"}
url = "https://www.popsike.com/php/quicksearch.php?searchtext=metal+-signed+-promo+-beatles+-zeppelin+-acetate+-test+-sinatra&sortord=aprice&pagenum=1&incldescr=1&sprice=100&eprice=&endfrom=2020&endthru=2020&bidsfrom=&bidsthru=&layout=&flabel=&fcatno="
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")


title = []
date = []
URL = []

record_div = soup.find_all('div', class_='col-md-7 add-desc-box')


for container in record_div:

    description = container.a.text
    title.append(description)

    link = container.find('a')
    URL.append(link.get('href'))

    purchase_date = container.find('span',class_= 'info-row').text
    date.append(purchase_date)


test_data = pd.DataFrame({
'record_description': title,
'link': URL,
'p_date': date
})

test_data['link'] = test_data['link'].str.replace('../','https://www.popsike.com/',1)


print(test_data)

test_data.to_csv('popaaron.csv')

NerrWK
  • 11
  • 1
  • Your test_data has correct values? Try to print test_data before storing to csv – Anjaly Vijayan Sep 21 '20 at 01:08
  • Is there a particular reason you are using data frames before you write to csv? There are simpler alternatives that can get the job done. Let me know if you are open to solutions. – SaaSy Monster Sep 22 '20 at 14:53
  • The honest answer is I'm a beginner and was following a tutorial to get this far, so I am absolutely open to suggestions and alternative solutions. Thank you, Ares! – NerrWK Sep 27 '20 at 19:13

1 Answers1

0

I suggest to change parser type:

soup = BeautifulSoup(results.text, "html5")

And fix search expression for purchase date:

purchase_date = container.select('span.date > b')[0].text.strip(' \t\n\r')
Alexandra Dudkina
  • 4,302
  • 3
  • 15
  • 27
  • wow thank you so much!! I needed to use this as parser type: soup = BeautifulSoup(results.text, "html5lib") and then this worked great, Alexandra. THANK YOU. – NerrWK Sep 27 '20 at 19:11
  • Alexandra if you don't mind me asking, what does the ">" mean in 'span.date > b' ...... does that search everything within the b container tag? – NerrWK Sep 29 '20 at 00:44
  • ">" is used to select an element within specific parent. It's also known as child combinator selector, i.e.it selects only direct children of a parent. Here is a detailed explanation with an example: https://stackoverflow.com/a/3225905/2792888 – Alexandra Dudkina Sep 29 '20 at 06:37