how to scrape subsequent pages and put output in a dataframe

Question

I am a beginner at websoup, i can't manage to to scrape several pages (5) on following website http://www.newyorksocialdiary.com/party-pictures (http://www.newyorksocialdiary.com/party-pictures?page=1-5) & i don't know how to put the output in a dataframe (dates). Thanks!

  from bs4 import BeautifulSoup
  import requests
  for i in range(10):
     url= "http://www.newyorksocialdiary.com/party-pictures".format(i)
     r=requests.get(url)
     soup= BeautifulSoup(r.text)

 for r in soup.findAll('span', attrs={'class': 'views-field views-field-created'}) :
     print r.get_text()

score 2 · Accepted Answer · answered Jan 30 '17 at 02:11

2

from bs4 import BeautifulSoup
import requests
for i in range(10):
    url= "http://www.newyorksocialdiary.com/party-pictures?page={}".format(i)
    r=requests.get(url)
    soup= BeautifulSoup(r.text)

    for span in soup.findAll('span', attrs={'class': 'views-field views-field-created'}) :
        print span.get_text()

You almost get it, just change your url.

answered Jan 30 '17 at 02:11

宏杰李

11,820
2
28
35

Thanks!! Very much... – Yasmine Nouri Jan 30 '17 at 03:11

score 1 · Answer 2 · answered Jan 30 '17 at 01:02

The general pattern of trying to scrape a website is first figure out how the page is implemented.

Generally

Your case : through a page parameter ?page=1 / 2 / 3 This is probably the easier one , you just keep a counter and loop through all the pages you need
1. through different absolute url, the easiest one
2. through html post requests, this may be a bit more tricky.

In your case, it is just a page variable, you can attach it to the base url and get what you want.

For the pandas part, theres a handy read_html option.

how to scrape subsequent pages and put output in a dataframe

2 Answers2