0

I am getting a webpage from the web like this

import requests

html = requests.get("http://www.google.com/")

this returns a whole lot of junk in the html variable what I want from this is that I want only the data that is displayed in the web browser and no other useless data like html tag head , link , meta , script and other useless tags and its content . I tried doing this with the HTMLParser module but it just strips the tags out of it . Any Idea how should i achieve this?

Zaid Khan
  • 253
  • 1
  • 2
  • 15
  • The `html` `head`, `link`, `meta`, `script`, etc are part of the html that is displayed in the web browser though. – AndrewL64 Feb 07 '17 at 20:36
  • As far as I know they are not displayed in the web browser they are there for animation or background purposes, by displayed i mean only the output that the user see as static . everything is inside html so leave html but link , meta, script etc. are a junk for me. Correct me if i am wrong... – Zaid Khan Feb 07 '17 at 20:39
  • The static elements displayed in your browser depends on the above tags Zaid (styling of the elements via the `link` tag for css, scripts via the `script` tag for javascript and such, etc). – AndrewL64 Feb 07 '17 at 20:44
  • Yes, I totally agree with You but i need to scrap just the text I don't want any styling or javascript code – Zaid Khan Feb 07 '17 at 20:48
  • Check this: http://stackoverflow.com/questions/11709079/parsing-html-using-python Just target the `body` instead of the `container` class in the answer. – AndrewL64 Feb 07 '17 at 20:56

0 Answers0