0

This is my first attempt to use web scraping in python to extract some links from a webpage. This the webpage i am interested in getting some data from:

http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5

I am interest in extracting all the instance of following from above webpage:

href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352"

I have written following regex to extract all the matches of above type of links:

r"href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\""

Here is quick code i have written to try to extract all the regex mataches:

  #!/usr/bin/python3
  import re
  import requests

  url = "http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5"

  page = requests.get(url)
  l = re.findall(r'href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\"', page.text)
  print(l)

When I run the above code I get following ouput:

./links2.py  
[]

When I inspect the webpage using developer tools within the browser I can see this links but when I try to extract the text I am interested in(href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352") using python3 script I get no matches.

Am I downloading the webpage correctly, how do I make sure I am getting all of the webapage from within my script. i have a feeling I am missing parts of the web page when using the requests to get the web page.

Any help please.

mohan08p
  • 5,002
  • 1
  • 28
  • 36
frank
  • 59
  • 2
  • 9
  • 2
    Great read: [How to debug small programs (#1)](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/) – Patrick Artner Mar 09 '18 at 11:21
  • Read: [Why is it a bad idea to use regex for scraping HTML](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not). – Keyur Potdar Mar 09 '18 at 12:12
  • The data you want is dynamically generated with Javascript. It is not available in the page source (which is what the `requests.get(...)` returns). – Keyur Potdar Mar 09 '18 at 12:14
  • Is there any way i can have generated saved as a text file on pc and then try extract the links? Is it possible to save the as a file using the browser. – frank Mar 09 '18 at 12:24
  • You would probably need to use `selenium` with headless Chrome. See for example: https://duo.com/decipher/driving-headless-chrome-with-python – mehdix Mar 09 '18 at 12:35

0 Answers0