3

I'm scrapping a web site using python 3.5 (Beautifulsoup). I can read everything in the source code but I've been trying to retrieve the embedded comments from disqus with no success (which is a reference to a script).

The piece of the html code source looks like this:

var disqus_identifier = "node/XXXXX";
script type='text/javascript' src='https://disqus.com/forums/siteweb/embed.js';

the src sends to a script function.

I've read the suggestions in stackoverflow, using selenium but I had a really hard time to make it work with no success. I understand that selenium emulates a browser (which I believe is too heavy for what I want). However, I have a problem with the webdrivers, it is not working correctly. So, I dropped this option.

I would like to be able to execute the script and retrieve the .js with the comments. I found that a possible solution is PyV8. But I can't import in python. I read the posts in internet, I googled it, but it's not working.

I installed Sublime Text 3 and I downloaded pyv8-win64-p3 manually in:

C:\Users\myusername\AppData\Roaming\Sublime Text 3\Installed Packages\PyV8\pyv8-win64-p3

But I keep getting:

ImportError: No module named 'PyV8'.

If somebody can help me, I'll be very very thankful.

kirogasa
  • 627
  • 5
  • 19
pacode
  • 63
  • 10
  • 1
    I found a way to do it. I suffices to construct the url to which points the embedded frame (one of the things that are done in the java script) and parsing it. – pacode Sep 25 '16 at 09:56

2 Answers2

3

So, you can construct the Disqus API by studying its network traffic; in the page source all required data are present. Like Disqus API send some query string. Recently I have extracted comments from Disqus API, here is the sample code.

Example: Here soup - page source and params_dict = json.loads(str(soup).split("embedVars = ")[1].split(";")[0])

def disqus(params_dict,soup):
    headers = {
    'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0'
    }
    comments_list = []
    base = 'default'
    s_o = 'default'
    version = '25916d2dd3d996eaedf6cdb92f03e7dd'
    f = params_dict['disqusShortname']
    t_i = params_dict['disqusIdentifier']
    t_u = params_dict['disqusUrl']
    t_e = params_dict['disqusTitle']
    t_d = soup.head.title.text
    t_t = params_dict['disqusTitle']
    url = 'http://disqus.com/embed/comments/?base=%s&version=%s&f=%s&t_i=%s&t_u=%s&t_e=%s&t_d=%s&t_t=%s&s_o=%s&l='%(base,version,f,t_i,t_u,t_e,t_d,t_t,s_o)
    comment_soup = getLink(url)
    temp_dict = json.loads(str(comment_soup).split("threadData\" type=\"text/json\">")[1].split("</script")[0])
    thread_id = temp_dict['response']['thread']['id']
    forumname = temp_dict['response']['thread']['forum']
    i = 1
    count = 0
    flag = True
    while flag is True:
        disqus_url = 'http://disqus.com/api/3.0/threads/listPostsThreaded?limit=100&thread='+thread_id+'&forum='+forumname+'&order=popular&cursor='+str(i)+':0:0'+'&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F'
        comment_soup = getJson(disqus_url)

It,will return json and you can find comments where you can extract comments. Hope this will help for you.

RamenChef
  • 5,557
  • 11
  • 31
  • 43
Prashant
  • 54
  • 3
  • Thanks @RamenChef, I used that. I'm trying to do thre same for Facebook comments, but it's much less clear how to get the api's keys. Do you have an idea on this? I would like to avoid using the Facebook's API for getting a token (which demands having an FB account). – pacode Sep 27 '16 at 07:02
  • So, @pacode you can get the API keys by monitoring the network. – Prashant Sep 27 '16 at 08:08
0

For Facebook embedded comments,you may use Facebook's graph api to extract the comments in json format.

Example-

Facebook comments - https://graph.facebook.com/comments/?ids= "link of page"

Prashant
  • 54
  • 3
  • Thanks for the answer @Prashant. Apparently this solution is no longer available. A token is needed. I receive the message: "message": "An access token is required to request this resource.". I think that somewhere between 2011 and 2015 the api changed. – pacode Sep 28 '16 at 05:55
  • I am still working on this problem, I am able now to do exactly what you suggested me. I can recover the url's id (looking at the html coude source on the frame where the Facebook plugin put the comments). When inserting this code in the Graph API, like this: `comments/?ids=[found_id]` Is it possible to retrieve this id or the url to where the plugin is pointing? Thanks for the lights! – pacode Oct 02 '16 at 11:14
  • so i found the solution of that.This is the new link where you can extract fb data by using FB API token.- `https://graph.facebook.com/849499125131056/comments/?access_token=[FB_Access Token]` .(For the id you can study the network of the page). Try to click on sort drop down box of Facebook plugin on the page.now you can get the page id by looking into the network.Hope this will works fine.. @pacode – Prashant Oct 03 '16 at 10:18
  • Yes, thanks for that. Actually I've followed your previous suggestion and I was able to recover the id from the network. My problem now is: given a url which has an embedded frame (plugin facebook) which shows facebook comments, how can I do to retrieve this id?. In your example I get from the api: `{ "created_time": "2015-05-11T21:47:48+0000", "title": "Olivia Palermo x Ciaté Collection", "type": "article", "id": "849499125131056"}` and, of course the comments. So, how do I know from what url this id comes from? doing something like graph.com/id=url/comments could work? – pacode Oct 03 '16 at 18:53
  • @pacode So,every URL which have Facebook page,there is a node id which i talking about.So you just go to that page where you want to extract FB comments.Try to click on sort drop down box of Facebook plugin on the page.now you can get the Unique Node id of the page by looking into the network. Sample link in network when doing above step-`https://www.facebook.com/plugins/comments/async/849499125131056/pager/social/?dpr=1`..Where you can extract id then put in above link which i shared previous. Thanks – Prashant Oct 04 '16 at 09:27