urllib2 and BeautifulSoup not pulling the full webpage

Question

A friend and I are trying to calculate some comment metrics (how many users comment on certain posts, who comments, how many comments does each user add, etc.) at a baseball blog that we frequent.

I know little to nothing about web programming or scraping, but I know a bit of Python so I volunteered to help (she was copying and pasting comments into .txt files and using Cmd + F to tally up comments).

My initial approach has utilized urllib2 and BeautifulSoup (Python 2.7):

import sys,re,csv,glob,os
from collections import Counter
import urllib2
from bs4 import BeautifulSoup

url = "http://www.royalsreview.com/2016/6/8/11881484/an-analysis-of-rr-game-threads#comments"
f = urllib2.urlopen(url).read()
soup = BeautifulSoup(f)

userlist = soup.find_all("div", class_="comment")

I sort of know what I'm looking for by going to the URL in a Chrome browser and clicking "Inspect" on a comment, which shows me the HTML bit of what I need to tally up comments.

However, when I use urllib2 to read the URL, the HTML that it pulls does not include the comments on that webpage.

From my research, I think it's because urllib2 will get the page's source from the server, but it won't include the content generated by JavaScript (I'm venturing from my comfortable place, here) or whatever (eg. the comments).

How can I get the page AFTER users have changed it by adding comments?

Thanks for the help

This might help you. http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python — ravindar, Jun 09 '16 at 15:36
@ravindar Yes, I found that page about 30 seconds after posting my question (I swear I did research before posting). It's amazing how typing out a question can give one a better idea of what one is looking for. Anyway, I'm installing dryscrape now and all my problems should be solved. — Nic, Jun 09 '16 at 15:38
What do you meant by "How can I get the page AFTER users have changed it by adding comments?" Could you please elaborate. — Abhirath Mahipal, Jun 09 '16 at 16:15

Padraic Cunningham · Answer 1 · 2016-06-09T22:39:32.133

You can get the data in json format by making a get request to http://www.royalsreview.com/comments/load_comments/11645525:

import requests
from collections import Counter
from operator import itemgetter
url = "http://www.royalsreview.com/comments/load_comments/11645525"
js =  requests.get(url).json()
cn = Counter(map(itemgetter("username"), js["comments"]))

print(cn)

Which gives you:

Counter({u'artzfreak': 20, u'RoyallyDisplaced': 9, u'sterlingice': 8, u'Max Rieper': 6, u'Scott McKinney': 6, u'Cody.McElroy': 3, u'GrassyKnoll': 3, u'Farmhand': 3, u'Minda Haas Kuhlmann': 3, u'Warden11': 2, u'nom nom nom de plume': 2, u'1040X': 2, u'Nighthawk at the Diner': 2, u'thelaundry': 2, u'Gopherballs': 2, u"Daenerys C. O'sFanaryen": 1, u'Shaun Newkirk': 1, u'Blue and Red': 1, u'wcgrad': 1, u'MrAndersonmm': 1, u'DCChiefFan': 1, u'J.K. Ward': 1, u'philofthenorth': 1, u'Mink Farmer': 1, u'keith jersey': 1, u'Kevin Ruprecht': 1, u'Tim Webber': 1, u'Matthew LaMar': 1, u'MightyMinx': 1, u'Quisenberry4Ever': 1, u'Daloath': 1, u'HalsHatsCrooked': 1, u'pete_clarf': 1})

If you print js["comments"] you will see a list of dicts like:

{u'ancestry': u'0379481445',
  u'bad_flags_count': 0,
  u'body': u'<blockquote>So maybe it\u2019s as hunter s. royal suggested, and it\u2019s because philofthenorth got a job.</blockquote>',
  u'created_on': u'2016-06-08T17:35:09.000Z',
  u'created_on_long': u'Jun  8, 2016 |  1:35 PM',
  u'created_on_short': u'06.08.16  1:35pm',
  u'created_on_timestamp': 1465407309,
  u'depth': 1,
  u'entry_id': 11645525,
  u'hidden': False,
  u'id': 379481445,
  u'inappropriate_flags_count': 0,
  u'parent_id': None,
  u'permalink': u'/2016/6/8/11881484/an-analysis-of-rr-game-threads/comment/379481445',
  u'recommended_flags_count': 5,
  u'shortlink': u'/c/379481445',
  u'signature': u'',
  u'spam_flags_count': 0,
  u'title': u'Love it',
  u'troll_flags_count': 0,
  u'user_id': 153964,
  u'username': u'sterlingice',
  u'version': 1}

Each comment has its own dict and hold all the info above.

To not have to hardcode the entry_id e can parse it from the actual page and then pass it in:

import requests
from collections import Counter
from operator import itemgetter

init_url = "http://www.royalsreview.com/2016/6/8/11881484/an-analysis-of-rr-game-threads#comments"

url = "http://www.royalsreview.com/comments/load_comments/{}"
entry_id = BeautifulSoup(requests.get(init_url).content).select_one("h2.m-entry__title")["data-remote-admin-entry-id"]
print(entry_id)
js = requests.get(url.format(entry_id)).json()
cn = Counter(map(itemgetter("username"), js["comments"]))

print(cn)

So no need for any javascript and you get all the data in a nicely formatted json.

Hmmm, interesting. I know nothing about JSON. I want to do this analysis with about 130 (and counting) webpages, and I'd like to figure out a way to loop it instead of having to manually retrieve each URL and put it into the script. Your entry_id formatting may do the trick. I can think of a way to request all pages from 2016, or something like that, but not sure if there's a way to do it if I want only specific types of posts. — Nic, Jun 09 '16 at 23:24
@Nic, think of json as basically being a dict. Where are the urls coming from? — Padraic Cunningham, Jun 09 '16 at 23:25
The full archives: http://www.royalsreview.com/archives/full. I'm only interested in those threads that have "Rumblings" in the title. — Nic, Jun 09 '16 at 23:26
Very simple to automate it all, it may actually be better to ask a new question with the specific info and make sure to add how you are currently parsing one page. You can add the link to it here and I will do up an answer — Padraic Cunningham, Jun 09 '16 at 23:27
Great! Got to run now. I'll set up that new question later and will make sure you see it. — Nic, Jun 09 '16 at 23:31
No worries, I have the code finished already, you can pass any year and month and pull all the links with rumblings in it and get a Counter of comments for each page — Padraic Cunningham, Jun 09 '16 at 23:42

score 0 · Answer 2 · answered Jun 09 '16 at 16:12

Use Selenium or Dryscape. Selenium opens up a web browser and you can actually see it do stuff. Since a browser is used, it renders Javascript as well.

I noticed a with the number of comments displayed in the page source.

<span class="comments-count">80</span>

You can scrape this span to get the number of comments. You can scrape this span at a different point of time to see if there are new comments (People are unlikely to delete comments)

Here is the code to scrape it using Selenium.

from selenium import webdriver
driver = webdriver.Firefox() # Opens firefox
driver.get('http://www.royalsreview.com/2016/6/8/11881484/an-analysis-of-rr-game-threads#comments')
userlist = driver.find_elements_by_class_name('comment') 
# ^ Finds all elements with the class comment and stores it in a list

# userlist is a list of all the tags with the class comment

print userlist[0].text

The print statement above gives out (The first comment in the page).

Love it So maybe it’s as hunter s. royal suggested, and it’s because philofthenorth got a job. by sterlingice on Jun 8, 2016 | 1:35 PM reply

Use userlist[1].text to print the second comment and so on.

And I'd suggest you to use

print len(userlist)

to cross reference with the number of comments on the page. The page says that there are 80 comments but the length of userlist is 82. Be sure to cross check that.

If you have any questions let me know :)

score 0 · Answer 3 · answered Jun 09 '16 at 19:54

What I ended up doing:

import sys,re,csv,glob,os
from collections import Counter
import dryscrape
from bs4 import BeautifulSoup

url = "http://www.royalsreview.com/2016/6/8/11881484/an-analysis-of-rr-game-threads#comments"

session = dryscrape.Session()
session.visit(url)
f = session.body()
soup = BeautifulSoup(f)

posterSoup = soup.find_all("a", class_="poster")
posterList = []
for poster in posterSoup:
    posterList.append(poster.string.lstrip())

posterCount = Counter(posterList)
with open('metrics.txt', 'w') as out:
    for entry in posterCount:
        out.write(entry + "\t" + str(posterCount[entry]) + "\n")

What I really wanted was the number of comments submitted by each poster. So I scraped all the comments, put them into a list, and used Counter to collapse that list into a counter container. I then could print out the poster's username followed by the number of comments contributed by that user.

urllib2 and BeautifulSoup not pulling the full webpage

3 Answers3