http request and regex in Python for HTML parser

Question

When I execute the script, the result is empty. Why? The script connected with a site and parse html tag <a>:

#!/usr/bin/python3

import re
import socket
import urllib, urllib.error
import http.client
import sys

conn = http.client.HTTPConnection('www.guardaserie.online');
headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Content-type": "application/x-www-form-urlencoded; charset=UTF-8" }
params = urllib.parse.urlencode({"s":"hannibal"})
conn.request('GET', '/',params, headers)
response = conn.getresponse();

site = re.search('<a href="(.*)" class="box-link-serie">', str(response.read()), re.M|re.I)
if(site):
  print(site.group())

Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Lex Scarisbrick, Aug 04 '16 at 18:14

score 1 · Accepted Answer · answered Aug 04 '16 at 18:13

1

It's likely the pattern you are searching for is non-existent in the read response, or it chokes at some point trying to parse html.

re.search( 'href="(.*)" class="box-link-serie"', str(response.read()), re.M | re.I )

Using something more generic or another parser method will likely lead you to your desired result.

answered Aug 04 '16 at 18:13

l'L'l

44,951
10
95
146

If you tried the pattern above it should return a result. I would recommend you try using these imports: `import re, httplib, socket, urllib, sys`, and change `params = urllib.urlencode`, as well as `conn = httplib.HTTPConnection` ... – l'L'l Aug 04 '16 at 18:30
the pattern return the entire html page – faserx Aug 04 '16 at 18:33
the result is always that – faserx Aug 04 '16 at 18:39
I get `href="http://www.guardaserie.online/ray-donovan-a/" class="box-link-serie"` when using `print(site.group())` ... python code here : https://gist.github.com/anonymous/43026f7262b2fddfb7643169f0d558b2 – l'L'l Aug 04 '16 at 18:40
[See comment #4](http://stackoverflow.com/questions/38774213/http-request-and-regex-in-python/38774564?noredirect=1#comment64921304_38774564). – l'L'l Aug 04 '16 at 20:44
the result is always that – faserx Aug 05 '16 at 08:25
I solved the problem by using beautiful soap to make the parser – faserx Aug 05 '16 at 21:05
I believe you mean the second problem was solved by using beautiful soup instead. The first problem, which the original question asks about in regards to the blank output, was solved by my answer and suggestions. – l'L'l Aug 06 '16 at 07:37

http request and regex in Python for HTML parser

1 Answers1