0

I managed to get the page source DOM of an external website, but it came with \r\n and lots of whitespace.

import urllib.request

request = urllib.request.Request('http://example.com')
response = urllib.request.urlopen(request)
page = response.read()
page = page.strip('\r\n')
print (page)

I tried stripping them, but no luck. How can I get just the HTML?

And secondly, what is the logic for manipulating the returned DOM with javascript/jquery? I was hoping to do something like:

alert(document.getElementsByTagName('h1')[0].innerHTML);

Which should alert "Example Domain" with the generated DOM.

O P
  • 2,327
  • 10
  • 40
  • 73
  • "no luck" isn't helpful. What does `print (page)` output? – Andy Nov 06 '14 at 19:35
  • @Andy `TypeError: Type str doesn't support the buffer API` – O P Nov 06 '14 at 19:36
  • Not sure if you're aware of this or not, but `strip` removes characters only from the beginning or the end of a string. For example, `"\na\nb\n".strip("\n")` returns `'a\nb'`. – Kevin Nov 06 '14 at 19:39
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – ivan_pozdeev Nov 06 '14 at 20:27

1 Answers1

2
'foo \r\n bar\r\n'.strip()

will only remove the '\r\n' at the end. If you have these throughout your text, try chaining .replace() like this:

'foo \r\n bar\r\n'.replace('\r', '').replace('\n', '').replace(' ', '')
vikramls
  • 1,802
  • 1
  • 11
  • 15