0

So i've been spending some time trying to learn how to write http request

my goal is to request the html of a web page parse and extract data from there

Im having trouble understanding how I can do this if i dont have the exact path to the file and all i have is the basic url like www.google.com

in a way im trying to do what urllib.request does but manually with socket programming in python

#Playing with Sockets

import socket

target_port=80
target_url ='www.google.com'

client=socket.socket(socket.AF_INET,socket.SOCK_STREAM)

client.connect((target_url,target_port))


request= "GET https://www.google.com HTTP/1.1\nHost:google.com\n\n"

message= request.encode()
client.send(message)

response=client.recv(4096)
print(response.decode())

1 Answers1

1

First of all, your HTTP request should use the new line separators \r\n (hex values 0x0D and 0x0A). You're only using \n (0x0A). Here's a good stackoverflow question on this.

Second, the path to the request file is relative to the host address. So when you call client.connect((target_url,target_port)) to connect to the host's HTTP server, it is ready to accept your request using a relative path.

Ultimately, your request should look like this

request= "GET /path/to/file.html HTTP/1.1\r\nHost:google.com\r\n\r\n"

You will probably need some additional headers in there as well.

Take a look here for more information. If that link doesn't take you to the correct section, I was talking particularly about the HTTP 1.1 Clients section. The Sample HTTP Exchange section is great also. Actually, you will probably find the whole page to be very useful.

mittmemo
  • 2,062
  • 3
  • 20
  • 27