1

I'm tryring to download a txt file using python and sockets, but error occurs when I decodes the content I get.

I'm using python3 and running test.py on windows, trying to fetch the content of http://linux.vbird.org/linux_basic/0330regularex/regular_express.txt

 python .\test.py linux.vbird.org 80 /linux_basic/0330regularex/regular_express.txt
# this file is named test.py
import socket
import sys

host = sys.argv[1]
port = sys.argv[2]
filename = sys.argv[3]
# creating a socket, using ipv4
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# connecting
s.connect((host, int(port)))
print("Connecting successful!\n")
str = "GET %s HTTP/1.0\r\n\r\n" % filename
s.sendall(str.encode('utf-8'))
while 1:
    try:
        buf = s.recv(2048)
    except socket.error as e:
        print("Error receiving data: %s" % e)
        sys.exit(1)
    if not len(buf):
        break
    sys.stdout.write(buf.decode('utf-8'))

I expected to get the content of given url,namely, the content of the txt file ,however, the error message is following:


Connecting successful!

Traceback (most recent call last): File ".\test.py", line 22, in sys.stdout.write(buf.decode('utf-8')) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 275: invalid start byte


CypherX
  • 7,019
  • 3
  • 25
  • 37
J.Butter
  • 57
  • 1
  • 6
  • text can be in different encoding then `utf-8` - ie. `latin1`, `cp1250`, etc. – furas Oct 25 '19 at 06:35
  • how do I firgure out that file's encoding? further more , what if I don't know source url's encoding? – J.Butter Oct 25 '19 at 07:18
  • Using chardet I can receive correct data, but there is another problem. – J.Butter Oct 25 '19 at 08:19
  • that is the website informs me to use http://linux.vbird.org, DO NOT USE http://www.vbird.org .Why this message occurs? My parameter to my problem is truly linux.bvird.org it's confusing... thanks. – J.Butter Oct 25 '19 at 08:21

2 Answers2

1

The HTTP header is ASCII and at most iso-8859-1 (single byte encoding of "ü" etc). It is not utf-8 (multi-byte encoding of "ü" etc). The encoding of the HTTP body can be anything, i.e. the body should be treated as bytes as long as the encoding is unknown.

The encoding can be given in the "charset" attribute in the Content-Type response header in case of text or HTML. It is not required though. In case of HTML it can also be given inside a meta tag. If it is not given the recipient might use defaults (which might not fit the actual encoding) or use heuristics to guess the encoding.

Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172
1

Originally it was answer to your question in comment about message "DO NOT USE vbird.org"- but finally it resolved other problem too.


linux.vbird.org and vbird.org have the same IP. They are on one server.

Socket converts linux.vbird.org to IP and it uses IP to connect to server - so server doesn't know that you want to get file from linux.vbird.org. It thinks that you want from vbird.org which is main domain. linux.vbird.org is only subdomain in domain vbird.org.

You would have to use header host: linux.vbird.org in request to inform server from what subdomain you try to get file.

GET /linux_basic/0330regularex/regular_express.txt HTTP/1.0
Host: linux.vbird.org

With this header it sends your file.

I tested this header with your code and accidently it resolves problem with encoding because your file is in UTF-8 and server send it as UTF-8 and there is no problem with buf.decode('utf-8')


import socket
import sys

host = 'linux.vbird.org' 
port = '80'
filename = '/linux_basic/0330regularex/regular_express.txt'

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host, int(port)))
print("Connecting successful!\n")

str = "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (filename,host)
print(str)

s.sendall(str.encode('utf-8'))
while True:
    try:
        buf = s.recv(2048)
    except socket.error as e:
        print("Error receiving data: %s" % e)
        sys.exit(1)
    if not len(buf):
        break

    #print(buf)
    sys.stdout.write(buf.decode('utf-8'))
furas
  • 134,197
  • 12
  • 106
  • 148
  • after give out host as linux.vbird.org in str, I successfully get the right result. A question is I found that in sendall() function, the parameter type must be **bytes**, what is the common method to deal with this situation, in my code, I use encode('utf-8') to transform it into bytes type, another question is, buf is what I get from server, and I don't know its encoding, should I guess it's encoding style first by some tool like **chardet** then print it? – J.Butter Oct 28 '19 at 08:01
  • standard method is to use `encode('utf-8')` to convert it to bytes. Currently probably most servers use `utf-8` as default encoding to send HTML but sometimes you can met older starndards like `cp1250` or `iso-8859-1` for Windows servers. So you can use `try/except` with different encodings or eventaully use chardet. But if you get file from server then normally you don't display it but you save it on disk without encoding - you open file to write in bytes mode - `open(..., 'wb')` - so you don't have to care of encoding problems. – furas Oct 28 '19 at 18:51
  • sometimes server may also send in response's header information what encoding it used to sends text/HTML data. – furas Oct 28 '19 at 18:58
  • if you use module `requests` then mostly you don't have to care of encoding because it try to recognize encoding - it can try to find it in response's header or in HTML tag ``, etc. – furas Oct 28 '19 at 19:00