0

I'm designing a link scraping program that grabs the basic link preview fields for a given URL, like page title, description, and images, etc. So far I've got a pretty good working version that uses the Python requests library and Beautiful Soup.

Most URLs come across perfectly, but when I try the url of a Facebook app, I get a different HTML response than if I accessed it from a browser directly. For instance, if I navigate to the app in a browser and view-source, I'll see a title field specific to that app. However, the HTML response in Python returns the generic Facebook.com title field.

I'm trying to understand how it is that Facebook app page is delivering a certain HTML response to my browser, and another one to my Python server.

Facebook app example: http://www.facebook.com/cocacola/app_106795496113635

From browser response:

<title>Coca-Cola</title>

From Python 'requests' response:

<title>Facebook</title>

Python code:

import requests
r = requests.get(url, allow_redirects=True)
html = r.text
print html

UPDATE: OK, so just realized the Python response is for a Facebook login page. This is a public app though, so the question is why does it want to require login from my server.

Yarin
  • 173,523
  • 149
  • 402
  • 512

4 Answers4

2

Like some other folks have mentioned, Facebook is looking at your User-Agent string. You can set it in the headers you send with your request:

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3)..."}
r = requests.get("http://www.facebook.com/cocacola/app_106795496113635",
                  headers=headers, allow_redirects=True)
print r.text

Otherwise you will get a redirect to the login page, as you have noticed.

jches
  • 4,542
  • 24
  • 36
1

So your script should present them self as usual web browser. You can use sniffer to analyse your requests to the facebook. Wireshark will be good to this task.

Here is example how request from Chrome looks like:

chrome request

And here is example how request from python script looks like:

>>> import urllib2
>>> opener = urllib2.build_opener()
>>> response = opener.open('facebook.com')
>>> response = opener.open('http://facebook.com')

urllib2 request

So as you can see facebook can easily recognize you as bot. Python bot. To look as web browser you have to add additional headers to your request.

In this question you can see how to check default headers: Changing user agent on urllib2.urlopen

Community
  • 1
  • 1
Adam
  • 2,254
  • 3
  • 24
  • 42
  • @Adam- thanks, your explanation is right but solution overly complicated- We can mod headers (and do everything) much more easily with the [requests](http://docs.python-requests.org/en/v0.10.7/index.html) library. – Yarin Mar 16 '12 at 03:34
1

Much easier is to use the chrome developer tools (Shift-Control-J or View->Developer->Developer Tools. Then go to the network tab, press the record button (a black circle by default when not recording, can be difficult to find at first). Then access facebook, highlight your request of choice, view the headers for that request in the sub-tabs. You're likely looking for something like

User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.79 Safari/535.11
Kurt Spindler
  • 1,311
  • 1
  • 14
  • 22
  • not sure what to do with this- I know how to access the headers from my browser, but what's that going to tell me? – Yarin Mar 15 '12 at 20:25
  • It is used to fool web sites. If you add header and title, it won't recognize you as a bot. – Froyo Mar 15 '12 at 20:31
  • Ah, ok. So addding this header on my request allows me to emulate a broswer. Got it. I'll try this and report back... – Yarin Mar 15 '12 at 21:40
1

Facebook doesn't allow bots. Maybe since you are just using request, it won't allow you to that page. And sends you to some other page.

You should register your app with facebook. Get Authorization done using oauth2, and then send those requests. It should work.

Froyo
  • 17,947
  • 8
  • 45
  • 73