2

I am a beginner at websoup, i can't manage to to scrape several pages (5) on following website http://www.newyorksocialdiary.com/party-pictures (http://www.newyorksocialdiary.com/party-pictures?page=1-5) & i don't know how to put the output in a dataframe (dates). Thanks!

  from bs4 import BeautifulSoup
  import requests
  for i in range(10):
     url= "http://www.newyorksocialdiary.com/party-pictures".format(i)
     r=requests.get(url)
     soup= BeautifulSoup(r.text)

 for r in soup.findAll('span', attrs={'class': 'views-field views-field-created'}) :
     print r.get_text()
Yasmine Nouri
  • 103
  • 11

2 Answers2

2
from bs4 import BeautifulSoup
import requests
for i in range(10):
    url= "http://www.newyorksocialdiary.com/party-pictures?page={}".format(i)
    r=requests.get(url)
    soup= BeautifulSoup(r.text)

    for span in soup.findAll('span', attrs={'class': 'views-field views-field-created'}) :
        print span.get_text()

You almost get it, just change your url.

宏杰李
  • 11,820
  • 2
  • 28
  • 35
1

The general pattern of trying to scrape a website is first figure out how the page is implemented.

Generally

  1. Your case : through a page parameter ?page=1 / 2 / 3 This is probably the easier one , you just keep a counter and loop through all the pages you need

    1. through different absolute url, the easiest one

    2. through html post requests, this may be a bit more tricky.

In your case, it is just a page variable, you can attach it to the base url and get what you want.

For the pandas part, theres a handy read_html option.

Bobby
  • 1,511
  • 1
  • 15
  • 24