This is my first attempt to use web scraping in python to extract some links from a webpage. This the webpage i am interested in getting some data from:
http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5
I am interest in extracting all the instance of following from above webpage:
href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352"
I have written following regex to extract all the matches of above type of links:
r"href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\""
Here is quick code i have written to try to extract all the regex mataches:
#!/usr/bin/python3
import re
import requests
url = "http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5"
page = requests.get(url)
l = re.findall(r'href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\"', page.text)
print(l)
When I run the above code I get following ouput:
./links2.py
[]
When I inspect the webpage using developer tools within the browser I can see this links but when I try to extract the text I am interested in(href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352") using python3 script I get no matches.
Am I downloading the webpage correctly, how do I make sure I am getting all of the webapage from within my script. i have a feeling I am missing parts of the web page when using the requests to get the web page.
Any help please.