7

How can I copy the source code of a website into a text file in Python 3?

EDIT: To clarify my issue, here's what I have:

import urllib.request

def extractHTML(url):
    f = open('temphtml.txt', 'w')
    page = urllib.request.urlopen(url)
    pagetext = page.read()
    f.write(pagetext)
    f.close()

extractHTML('http:www.google.com')

I get the following error for the f.write() function:

builtins.TypeError: must be str, not bytes
user1306802
  • 71
  • 1
  • 1
  • 3
  • Have you tried looking here?: http://stackoverflow.com/questions/5512811/builtins-typeerror-must-be-str-not-bytes – Jack Apr 01 '12 at 21:08
  • Surprisingly, none of the answers (except one) actually addressed the issue.. `pagetext` is NOT a string.. It's actually bytes. So to convert it to a string, you need to use `f.write(pagetext.decode('utf-8'))` which will a UTF-8 encoded string to the file. – Brandon Oct 12 '17 at 23:57
  • @Brandon I tried what you said and got an error `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 8482: invalid start byte`. I just literally copied down my answer without the `str()` and put `f.write(pagetext.decode('utf-8'))` in the place of `f.write(pagetext)`. Any idea why this is not working for me. If you are using Python 2 that might be why – Xantium Oct 13 '17 at 00:40
  • Does this answer your question? [Save HTML of some website in a txt file with python](https://stackoverflow.com/questions/24297257/save-html-of-some-website-in-a-txt-file-with-python) – Gino Mempin Oct 24 '20 at 09:06

3 Answers3

3
import urllib.request
site = urllib.request.urlopen('http://somesite.com')
data = site.read()
file = open("file.txt","wb") #open file in binary mode
file.writelines(data)
file.close()

Untested but should work.

EDIT: Updated for python3

Jack
  • 740
  • 5
  • 21
  • Oops, sorry. What's the issue in python 3? – Jack Apr 01 '12 at 20:50
  • urllib2 doesn't exist, for starters. I think typically you'd use the urllib.request module (that's where urlopen now lives.) – DSM Apr 01 '12 at 20:51
  • Oops, seems this is redundant now that OP has updated their post. – Jack Apr 01 '12 at 21:07
  • I think you will have the same str/bytes problem. The HTTP response has bytes, but you've opened the file for writing str. The simplest way is just to open the file in binary mode (with `"wb"`). – Thomas K Apr 01 '12 at 22:35
  • Yes, this was posted before OP added what their problem was. – Jack Apr 02 '12 at 07:04
  • This works fine, thanks. Is there a way to store the source code as lines of strings? – user1306802 Apr 02 '12 at 17:39
  • I mean as opposed to an array of bytes (writing in binary mode), simply writing them to a file as a string. – user1306802 Apr 03 '12 at 14:00
  • 1
    Using wb gives me this error for http://notalwaysright.com/page/1: `TypeError: 'int' does not support the buffer interface` – rassa45 Jun 28 '15 at 16:10
1

Try this.

import urllib.request
def extractHTML(url):
    urllib.request.urlretrieve(url, 'temphtml.txt')

It is easier, but if you still want to do it that way. This is the solution:

import urllib.request

def extractHTML(url):
    f = open('temphtml.txt', 'w')
    page = urllib.request.urlopen(url)
    pagetext = str(page.read())
    f.write(pagetext)
    f.close()

extractHTML('https://www.google.com')

Your script gave an error saying it must be a string. Just convert bytes to a string with str().

Next I got an error saying no host was given. Google is a secured site so https: not http: and most importantly you forgot to include // at the end of https:.

Xantium
  • 11,201
  • 10
  • 62
  • 89
0

probably you wanted to create something like that:

import urllib.request

class ExtractHtml():

    def Page(self):

        print("enter the web page name starting with 'http://': ")
        url=input()
        site=urllib.request.urlopen(url)
        data=site.read()
        file =open("D://python_projects/output.txt", "wb")
        file.write(data)
        file.close()






w=ExtractHtml()
w.Page()