Python How to get a specific code in website using re

Question

I'm trying to make python challange. http://www.pythonchallenge.com/pc/def/ocr.html Ok. I know, I can just copy paste the code from source to a txt file and make things like that but I want to take it from net for improving myself. (+ I have done it already) I have tried

re.findall(r"<!--(.*?)-->,html)

But it doesn't get anything. If you want my full code is here:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall(r"<!--(.*)-->",str(x.content))
print codes

Also I tried making it like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--\n(.*)\n-->",str(x.content))
print codes

Now it finds the text but still can't get that mess :(

score 2 · Answer 1 · edited May 23 '17 at 11:52

2

I would use an HTML parser instead. You can find comments in HTML with BeautifulSoup.

Working code:

import requests
from bs4 import BeautifulSoup, Comment


link = "http://www.pythonchallenge.com/pc/def/ocr.html"
response = requests.get(link)

soup = BeautifulSoup(response.content, "html.parser")

code = soup.find_all(text=lambda text: isinstance(text, Comment))[-1]
print(code.strip())

edited May 23 '17 at 11:52

Community

1
1

answered Jul 07 '16 at 16:06

alecxe

462,703
120
1,088
1,195

I tried using bs4 but it was too hard to control bs4 :( also I'm likely new to python so I can't mostly understand lambda's. Can you make a function? – Dr. UK Jul 07 '16 at 16:08
@Dr.UK well, if you have HTML source to parse, you [should not be using regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) in general. As for you question, `lambda` is just a short way to write a function. You rarely need to pass functions in BeautifulSoup like in this case, BeautifulSoup is quite a convenient and easy to work with library. Hope that helps. – alecxe Jul 07 '16 at 16:13
@Dr.UK, if you find bs4 too hard to control then good luck with nested elements and a regex. – Padraic Cunningham Jul 07 '16 at 16:40

wpcarro · Accepted Answer · 2016-07-07T16:05:41.337

1

Not sure what you mean by "that mess". You should include all of the details of the challenge within this post, instead of linking users to the pythonchallenge post.

Either way, if you set the regex to be in single-line mode, //s, then the dot character, ., should match newlines, /n, as well. This obviates the \n(.+)\n construction in your regex, which may solve your problem.

Here's a link to a working regex example.

Here is the modified python 2.7 code:

#!/usr/bin/python
import requests, re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--(.*?)-->", str(x.content), re.S)
print codes[1]

Note the re.S, (.*?), and codes[1] modifications.

re.S is python's flag for //s
(.*?) makes the * quantifier non-greedy
codes[1] prints the second set of content found within HTML comments (since findall(..) matches 2 and returns an array of both sets).

edited Jul 07 '16 at 16:05

answered Jul 07 '16 at 15:50

wpcarro

1,528
10
13

Please provide a python2.7 example. – Brian Jul 07 '16 at 15:55
In the website they introduced that code with name "mess" also you don't need an account to go that website. Btw thanks for your answer. – Dr. UK Jul 07 '16 at 16:06
Is the answer working for what you're after? It seemed to work for me. – wpcarro Jul 07 '16 at 16:08
Also I know it is bad to make questions in question but can you explain me the (.*?) Is that a prefix like "Get everything here"? – Dr. UK Jul 07 '16 at 16:12

score 1 · Answer 3 · answered Jul 07 '16 at 15:56

1

You can solve:

codes = re.findall("/<!--(.*?)-->/s",str(x.content))

"s" find with whitespace and breakline

answered Jul 07 '16 at 15:56

kollein

328
3
10

Python How to get a specific code in website using re

3 Answers3