0

The following is my code to download a webpage (writing a basic wget)

HTTP request:

port = 80
#assume ip is known
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((ip, port))
http_message = "GET /" + path + " HTTP/1.1\r\n"
http_message += "Host: " + website + "\r\n"
http_message += "Accept: text/html\r\n"
http_message += "Accept-Language: en-US,en;q=0.9\r\n"
http_message += "Accept-Encoding: gzip, deflate\r\n"
http_message += "User-Agent: Chrome/92.0.4515.131 Mozilla/5.0 (X11; Linux x86_64)\r\n"
http_message += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\n"
http_message += "Connection: keep-alive\r\n"
http_message += "\r\n"

x = sock.send(http_message.encode())

data = sock.recv(262144).decode("utf-8", "ignore")
print(data)

HTTP response on terminal:

HTTP/1.1 200 OK
Server: nginx/1.10.3
Date: Tue, 07 Sep 2021 18:52:00 GMT
Content-Type: text/html
Last-Modified: Thu, 31 Oct 2019 09:15:26 GMT
Transfer-Encoding: chunked
Connection: keep-alive
ETag: W/"5dbaa62e-8d0e"
Content-Encoding: gzip

272b
n~}.^~ןO?^kd[%+djj      \$s>t7H3o>6Osn>
          D˴?vuv{&oơ;-%Z8jYvNy]ӌ7q<n6Mݣ#g7-ocO
                                              g0M|8vc9ؒ̉mA5MkbH~zӊoE{-FS;۷!Q<iEs-Q4t

8�<O~Ֆ҉lo?`ǖ    6(~Lc7. *7ȳ#l6_#Ai?:m}';k278W9$/2~�dn?n-?/\uS>6]1..vM|:98a`0{^p@/=/)j=
r(=/zP}56HfMK36̍F{cMhI<)ؚ+68 u2
y/
  Ui
5iîBENöL߷mq]t|
:
 =0
   g;tĹ?q}  ЙJ
                  S\'0ZМ2ͦ,ޫOM:Ѳ90]LL('x'O1O?ߝ\3{>byx$ơsm݉.|H|G(41۩#?َ;{=
                                                                     ҺwGOG'O<T٣Guy9Q
_&?q-7GZJ-urD;scKo*&V]j:    9>"vYs
        -K.0"38үsl>v;Y0vv¨[U^UmQ`N[T
                                    HL6ݵ7(~a^3"w"?#80&B^]"7U~b
                                                              -wJyXx,�aymwo
                                                                           ^ݚBLJ=)JMܶS{/[z3&ØW&g(w)di͍bL S:R$B2֯m[|#XBm^Ei9b|,~D2G*.
                    p7+(ù!u.w ?>t+}M,x!7'Ό9|0/#S/1UbA5Fui0dPO#枽,s˄7 l5$OVin֟eAӋ:YPLsӔm}۩c
             ^sEh6S


mӽwG=X}ΤV*-جk70FI`!jxCr"ϐ+bUo
                             RE[/WR1k|%j    eBB(l3^H6cuP]PM-i[%h

The weird output continues....

The following output is in in gzip format which I am not able to decompress to a txt file. Copied the weird output (except the the http response) from the terminal to output.txt.gz Used gzip module :

import gzip
f=gzip.open('output.txt.gz','rb')
file_content=f.read()
print (file_content)

OUTPUT :

gzip.BadGzipFile: Not a gzipped file (b'27')

Cant find the exact format for gzip..

Also if i dont decode the response

data = sock.recv(262144)

i get a huge binary file which may help...Binary Response Image

  • 1
    "Have tried the gzip module, online convertor etc. nothing seem to work." - please show the code you've tried, and be specific in terms of what the result was. – Jon Skeet Sep 07 '21 at 19:05

1 Answers1

1

You are not taking the HTTP response's Transfer-Encoding: chunked format into account.

In fact, you are not even taking the HTTP protocol itself into account. Your code is completely ignoring the fact that HTTP has structure and rules to how it works. You are just reading raw bytes from the socket, decoding everything to Unicode using UTF-8 (corrupting everything that should not be decoded), printing the Unicode to the terminal, and then copy/pasting that into a text file with a .gz extension. GZIP is not text.

Not all of the HTTP response data is GZIP data, some of it is HTTP chunk data instead. You need to actually process the HTTP protocol correctly. Read ONLY the HTTP response headers first, then PARSE them to determine the format of the response body. IF the response is chunked, then READ AND PARSE the chunks properly, saving ONLY the data portion of each chunk into the output file, as-is as binary not text.

See my answer to Differ between header and content of http server response (sockets) for more details about this.

For instance, the '27' that f.read() is complaining about is from the initial 272b in the response, which specifies the byte size (10027) of the GZIP data stored in the 1st chunk. You are saving the entire chunk to the .gz file, not just the GZIP portion of the chunk.

There can be more than 1 chunk present, so you have to read and parse each chunk individually. The response data will be terminated by a chunk with a size of 0 bytes.

See RFC 2616 Section 3.6.1 and RFC 7230 Section 4.1 for more details on how HTTP's chunked encoding works.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Alright! I have read the documentation of Gzip formating and also parsed the http response to check whether "Content-Encoding" is present or not. After googling a bit i found that if i parse till the start of gzip and then use zlib.decompress then it solves the problem! Thanks for your help! – Aditya Verma Sep 08 '21 at 17:28
  • This issue has nothing to do with gzip formatting, and everything to do with your mishandling of the HTTP protocol. Had you been parsing the HTTP response properly and saving only the *relevant* pieces of data to your output `.gz` file, `gzip` would have been able to `read()` the file just fine. If you are going to decompress the response dynamically rather than using `gzip`, you still need to parse the HTTP chunks properly, you can't just decompress the entire response body as-is, you have to de-chunk it properly. – Remy Lebeau Sep 08 '21 at 18:13