0

so I'm trying to parse public facebook pages using BeautifulSoup. I've managed to successfully scrape LinkedIn, but I've spent hours trying to get it to work on facebook with no luck. The code I'm trying to use looks like this:

for urls in my_urls:
try:
    page = urllib2.urlopen(urls)
    soup = BeautifulSoup(page)
    info = soup.find_all("div", class_="fsl fwb fcb")
    info2 = info.findall('a')

The part that's frustrating me is that I can get the title element out, and I can even get pretty far down the document, but I can't get to the part where I need to get.

This line successfuly grabs the pageTitle:

info = soup.find_all("title", attrs={"id": "pageTitle"})

This line can get pretty far down the list of elements, but can't go any farther.

info = soup.find_all(id="pagelet_timeline_main_column")

Here's a sample page that I'm trying to parse, I want current city from it:

https://www.facebook.com/100004210542493

and heres a quick screenshot of what the part I want looks like:

http://prntscr.com/1t8xx6

I feel like I'm really close, but I just can't figure it out. Thanks in advance for any help!

EDIT 2: I should also mention that I can successfully print the whole soup and visually find the part I need, but for whatever reason the parsing just won't work the way it should.

cscanlin
  • 178
  • 2
  • 8
  • 21

1 Answers1

2

Try looking at content returned by using curl or wget. What you are seeing in the browser is what has been rendered after javascripts has been executed.

wget https://www.facebook.com/100004210542493

You might want to use memchanize or selenium, since you want to simulate a client browser (instead of handling raw content).

Another issue related to it might be Beautiful Soup cannot find a CSS class if the object has other classes, too

Community
  • 1
  • 1
surajz
  • 3,471
  • 3
  • 32
  • 38
  • Hey thanks for taking the time to help me out. I'm sorry, I'm still learning and I'm not sure if I understand what you're saying. How do I strip out comment tags? A search led me here: http://stackoverflow.com/questions/3507283/how-can-i-strip-comment-tags-from-html-using-beautifulsoup but I cant seem to figure it out. – cscanlin Sep 24 '13 at 21:20
  • 1
    No, I have updated by answer. Look at the raw content instead of using the browser. or do - response = urllib2.urlopen('https://www.facebook.com/100004210542493') and then -response.read() to view the content – surajz Sep 24 '13 at 21:21
  • alright so when I run that I get the following response: > I really havent done enough of this to know if it's even reading the URL right now, or erroring out. – cscanlin Sep 24 '13 at 21:35
  • I'm also trying mechanize out, but keep on getting another error: "httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt" – cscanlin Sep 24 '13 at 21:40
  • hey just wanted to say thanks for the input earlier. I managed to get it using mechanize and re. Your comment helped me get on the right train of mind though, Thank You! – cscanlin Sep 25 '13 at 00:16
  • @cscanlin how did you solve "HTTP Error 403:request disallowed by robots.txt" – sau May 11 '15 at 08:17
  • Truthfully, it's been so long I have no idea. A quick search seems to suggest that adding: `br.set_handle_robots(False)` may be helpful. This seems to be against many websites policies though, so use at your own risk http://stackoverflow.com/a/3849843/1883900 – cscanlin May 12 '15 at 19:42