3

This is my code:

import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send(b'GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')
while True:
    data = mysock.recv(1024)
    if ( len(data) < 1 ) :
        break
    print(data)
mysock.close()

and the output is:

b'HTTP/1.1 200 OK\r\n
  Date: Thu, 17 Mar 2016 01:45:49 GMT\r\n
  Server: Apache\r\nLast-Modified: Fri, 04 Dec 2015 19:05:04 GMT\r\n
  ETag: "e103c2f4-a7-526172f5b5d89"\r\n
  Accept-Ranges: bytes\r\n
  Content-Length: 167\r\n
  Cache-Control: max-age=604800, public\r\n
  Access-Control-Allow-Origin: *\r\n
  Access-Control-Allow-Headers: origin, x-requested-with, content-type\r\n
  Access-Control-Allow-Methods: GET\r\n
  Connection: close\r\n
  Content-Type: text/plain\r\n
  \r\n
  But soft what light through yonder window breaks\n
  It is the east and Juliet is the sun\n
  Arise fair sun and kill the envious moon\n
  Who is already sick and pale with grief\n'

however, I want it to be:

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

so what should I do?

zmo
  • 24,463
  • 4
  • 54
  • 90
Bs He
  • 717
  • 1
  • 10
  • 22
  • (I added the *real* carriage returns for lisibility of the output, obviously it was on a single line) – zmo Mar 17 '16 at 02:02
  • By the way, why are you sending raw HTTP requests? Perhaps you want to use a library like `urllib` or `requests` instead. – nneonneo Mar 17 '16 at 02:09

2 Answers2

1

Because the data returned by the socket is of the bytes class, and then is considered as plain ascii or binary, you need to make it into a string. So, just replace:

print(data)

with

print(s.decode('utf-8'))

and you'll have it not anymore as a single line, but as a nicely printed string.

And, to extract the contents, you only need to do:

print(s.decode('utf-8').split('\r\n\r\n', 1)[1])

you'll get the content, as the HTTP standard specifies that the headers and the content are separated by a double carriage return, new line (i.e. \r\n\r\n).

zmo
  • 24,463
  • 4
  • 54
  • 90
  • This almost works except that there may be blank lines in the html so the split would take too little. `split('\r\n\r\n', 1)[1]` grabs it all. – tdelaney Mar 17 '16 at 02:26
  • Well, his example is a file with unix line endings, so the header part and the content part are easy to separate. But with html, you're right that can happen, and then what you said would do the job. – zmo Mar 17 '16 at 02:49
  • @zmo can you see why `.decode()` doesn't work when the raw page is a JEPG picture. – Bs He Mar 17 '16 at 20:09
  • Please do not recycle old questions for new issues. Please make a new question about that. – zmo Mar 17 '16 at 20:10
  • And about that, I already [answered that question today](http://stackoverflow.com/a/36066208/1290438) ☺ – zmo Mar 17 '16 at 20:13
  • @zmo Sorry but I can't see some insights there. The data retrieved from HTTP is Bytes string, right? But why `type(data[245])` returns `int`? – Bs He Mar 17 '16 at 21:01
  • that's because `data[245]` is returning a single value within the byte array, and each element of a bytearray is an integer. Please make a new question to have somebody else help you, as I'm going to bed now. cheers! – zmo Mar 17 '16 at 21:03
0

The long answer is... This is HTTP so the header and payload are separated by an empty line. Find the first empty line and your data is next.

empty_line = b'\r\n\r\n'
index = data.index(empty_line)
payload = data[index + len(empty_line):]

Now you've got the right byte string but it needs to be decoded into a string. Since the header doesn't give us a hint, utf-8 is a reasonable choice.

text = payload.decode('utf-8')

The short answer is to use a tool such as requests to figure it out for you.

import requests
text = requests.get('http://www.py4inf.com/code/romeo.txt').text
tdelaney
  • 73,364
  • 6
  • 83
  • 116