Web scrape using Beautifulsoup , brings different content

Question

If you visit http://www.imdb.com/title/tt2375692/episodes?season=1 here, then you will see that season 1,episode 1's publish date is 25 Jan. 2014,

This is the code I am using to scrape.

    req = urllib2.Request('http://www.imdb.com/title/tt2375692/episodes?season=1')
    self.diziPage = urllib2.urlopen(req).read()
    self.diziSoup = BeautifulSoup(self.diziPage,from_encoding="utf8")

After I scrape the site, everything is fine except the airdate, episode 1 's airdate comes out 20 April 2014, which is not in present when I visit, all of the rest informations comes corrent.

I thought it may be because of headers I did some experiments but that didnt work.

Shows `20 April 2014` for me when I visit that page in my browser as well … — CBroe, Mar 16 '14 at 18:43

score 2 · Answer 1 · answered Mar 16 '14 at 20:00

2

I get 25 Jan. 2014 when I scrape the date using BeautifulSoup. First, find the link to the first episode I., then get the episode block by taking parent of the link parent, then find the date by class inside:

import urllib2
from bs4 import BeautifulSoup


url = "http://www.imdb.com/title/tt2375692/episodes?season=1"

soup = BeautifulSoup(urllib2.urlopen(url))

episode1 = soup.find('a', {'title': 'I.'}).parent.parent
print episode1.find('div', {'class': 'airdate'}).text.strip()

prints:

25 Jan. 2014

answered Mar 16 '14 at 20:00

alecxe

462,703
120
1,088
1,195

This is weird, when I run your script, everything as expected. When I run it on my server I get 20 Apr. 2014. Do you think imdb serves content depending on visitors ip? And second thing and weirdest thing, apart from episode 1, I get correct airdates. thx – durdenk Mar 16 '14 at 20:50
@durdenk well, several things may have an influence. First of all, it is a mystery where does `20 Apr. 2014` come from - there is no such date in the source code of the page. Looks like a different url is used for parsing. – alecxe Mar 16 '14 at 21:38
I just copied and paste your code, run it locally and in my server got different outputs, it may be because http headers or visitor ip's. Seems like need an another website to parse airdates or something. – durdenk Mar 16 '14 at 22:14
By the way, 20 April comes from episode 1's release dates, Germany 20 April 2014 My server located in Germany, even though ı added relevant content headers, I got different content only for episode 1, thats strange. – durdenk Mar 18 '14 at 15:12

score 0 · Accepted Answer · answered Jul 25 '15 at 21:02

0

Seems like, imdb provides different air dates according to visitors location. This is why I m getting different data, I think they check visitor's ip or something.

answered Jul 25 '15 at 21:02

durdenk

1,590
1
14
36

Web scrape using Beautifulsoup , brings different content

2 Answers2

Linked