0

I'm trying to make python challange. http://www.pythonchallenge.com/pc/def/ocr.html Ok. I know, I can just copy paste the code from source to a txt file and make things like that but I want to take it from net for improving myself. (+ I have done it already) I have tried

re.findall(r"<!--(.*?)-->,html)

But it doesn't get anything. If you want my full code is here:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall(r"<!--(.*)-->",str(x.content))
print codes 

Also I tried making it like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--\n(.*)\n-->",str(x.content))
print codes 

Now it finds the text but still can't get that mess :(

Dr. UK
  • 73
  • 1
  • 2
  • 7

3 Answers3

2

I would use an HTML parser instead. You can find comments in HTML with BeautifulSoup.

Working code:

import requests
from bs4 import BeautifulSoup, Comment


link = "http://www.pythonchallenge.com/pc/def/ocr.html"
response = requests.get(link)

soup = BeautifulSoup(response.content, "html.parser")

code = soup.find_all(text=lambda text: isinstance(text, Comment))[-1]
print(code.strip())
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I tried using bs4 but it was too hard to control bs4 :( also I'm likely new to python so I can't mostly understand lambda's. Can you make a function? – Dr. UK Jul 07 '16 at 16:08
  • @Dr.UK well, if you have HTML source to parse, you [should not be using regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) in general. As for you question, `lambda` is just a short way to write a function. You rarely need to pass functions in BeautifulSoup like in this case, BeautifulSoup is quite a convenient and easy to work with library. Hope that helps. – alecxe Jul 07 '16 at 16:13
  • @Dr.UK, if you find bs4 too hard to control then good luck with nested elements and a regex. – Padraic Cunningham Jul 07 '16 at 16:40
1

Not sure what you mean by "that mess". You should include all of the details of the challenge within this post, instead of linking users to the pythonchallenge post.

Either way, if you set the regex to be in single-line mode, //s, then the dot character, ., should match newlines, /n, as well. This obviates the \n(.+)\n construction in your regex, which may solve your problem.

Here's a link to a working regex example.

Here is the modified python 2.7 code:

#!/usr/bin/python
import requests, re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--(.*?)-->", str(x.content), re.S)
print codes[1]

Note the re.S, (.*?), and codes[1] modifications.

  • re.S is python's flag for //s
  • (.*?) makes the * quantifier non-greedy
  • codes[1] prints the second set of content found within HTML comments (since findall(..) matches 2 and returns an array of both sets).
wpcarro
  • 1,528
  • 10
  • 13
  • Please provide a python2.7 example. – Brian Jul 07 '16 at 15:55
  • In the website they introduced that code with name "mess" also you don't need an account to go that website. Btw thanks for your answer. – Dr. UK Jul 07 '16 at 16:06
  • Is the answer working for what you're after? It seemed to work for me. – wpcarro Jul 07 '16 at 16:08
  • Also I know it is bad to make questions in question but can you explain me the (.*?) Is that a prefix like "Get everything here"? – Dr. UK Jul 07 '16 at 16:12
1

You can solve:

codes = re.findall("/<!--(.*?)-->/s",str(x.content))

"s" find with whitespace and breakline

kollein
  • 328
  • 3
  • 10