4

I've got to find the images in a html source code. I'm using regex instead of html.parser because I know it better, but if you can explain to me how to use html parsing like you would a child, I'll be happy to go down that road too.

Can't use beautifulsoup, wish I could, but I got to learn to do this the hard way.

I've read through a lot of questions and answers on here on regex and html (example) so I'm aware of the feelings on this topic.

But hear me out!

Here's my coding attempt (Python 3):

import urllib.request
import re

website = urllib.request.urlopen('http://google.com')
html = website.read()
pat = re.compile (r'<img [^>]*src="([^"]+)')
img = pat.findall(html)

I double checked my regex on regex101.com and it works at finding the img link, but when I run it on IDLE, I get a syntax error and keeps highlighting the caret. Why?

I'm headed in the right direction... yes?

update: Hi, I was thinking may be I get short quick answer, but it seems I may touched a nerve in the community.

I am definitely new and terrible at programming, no way around that. I've been reading all the comments and I really appreciate all the help and patience users have shown me.

Community
  • 1
  • 1
  • well in this context, it means more familiar. i only got thrown html parsing by my teacher and no working examples. got any leads to point me to? – pythonintraining Oct 20 '13 at 12:29
  • @user2799617 He's a newbie. Please try to be a bit more civil. From the looks of things you know nothing about proper netiquette. – Games Brainiac Oct 20 '13 at 12:29
  • 1
    You're getting a syntax error because... this is invalid syntax (hint: `re.compile` expects a string). But you should just take a look at the BeautifulSoup html parser, there's enough examples on here and elsewhere that should get you started. – l4mpi Oct 20 '13 at 12:31
  • I'm pretty confident that this is mostly correct code. The regex is giving me a headache, but I'm been trying to teach myself html.parsing and that gave me a headache too. I don't know that damn caret keeps getting called out... – pythonintraining Oct 20 '13 at 12:32
  • okay, here is a question...how do i do that? i downloaded it, but my computer looked at the .gz and went "da fuq??" so I figured I'm really close...seriously, i know nothing better. it's just me and idle and a badly written python text book. – pythonintraining Oct 20 '13 at 12:34
  • 3
    @user2799617 The person has asked a valid question, showed us what he's tried, and checked it on regex101 (which we need a link of). I highly doubt that he has done _anything_ wrong. – Games Brainiac Oct 20 '13 at 12:35
  • Persons parsing HTML with a regex are _always_ doing something wrong. –  Oct 20 '13 at 12:37
  • 1
    @pythonintraining For the gz issue, I guess you're using Windows. Install a utility like 7Zip. – nanofarad Oct 20 '13 at 12:37
  • http://regex101.com/r/nW1aO8 – pythonintraining Oct 20 '13 at 12:37
  • 2
    @user2799617 [Not always](http://stackoverflow.com/a/1733489/1424875). – nanofarad Oct 20 '13 at 12:37
  • 2
    hey user2799617, you don't need to ride me, i already ride myself hard enough. i thought the point of stackoverflow was to help people like me, go to reddit or craigslist if you want to keep on ranting. – pythonintraining Oct 20 '13 at 12:39
  • 1
    "That's mostly correct code" - except for the thing you're trying to pass to `re.compile`. Which should be a string. But isn't. Thus it's invalid syntax, as python doesn't know what it's supposed to be (it highlights the caret, because the partial expression before that could have been a valid term). That's about as patient as I can explain this... you know what a string is, right? @GamesBrainiac not doing research and basic debugging could still be considered doing something wrong. Just looking at examples of `re.compile` and how they differ from his code would have probably helped him... – l4mpi Oct 20 '13 at 12:40
  • Well to start off with, your string needs to be quoted `re.compile (r']*src="([^"]+)')` – Burhan Khalid Oct 20 '13 at 12:43
  • 1
    @l4mpi Perhaps he's using something as trivial as notepad++ or perhaps he's now understand where he's going wrong. Perhaps he's confused the `r` infront of a string with regex when its actually raw. There are tonnes of places when a person can mess up. We _need_ to be considerate of newbies. – Games Brainiac Oct 20 '13 at 12:47
  • @GamesBrainiac First of all, "we" don't "need" to do anything. I agree there's no need to flame or harrass someone who's new, but I strongly disagree with upvoting trivial, off-topic, duplicate or lazy questions just because the asker is new. This question is basically a mix between "find my typo" and "I don't understand this syntax error" (without even posting the error), with a meta-question of "I'm trying to shoot myself in the foot" - all of these are hardly on topic. IMO one should indicate the error in a comment, maybe downvote, and move on - but certainly not upvote 4 times. – l4mpi Oct 20 '13 at 12:58
  • @l4mpi Well, if you haven't figured it out, one person can only vote _once_, some some people might have agreed that downvoting the hell out of this question was a mistake. Secondly, this guy is such a newbie, that he figured out _yesterday_ that you can put a for loop in a for [loop](http://stackoverflow.com/questions/19449476/finding-and-printing-file-name-of-zero-length-files-in-python#comment28860154_19449627). He might not even know what on earth _documentation_ means. – Games Brainiac Oct 20 '13 at 13:06
  • @GamesBrainiac "He might not even know what on earth documentation means" then he should certainly be given some helpful comments and links to docs/tutorials, but _not_ be rewarded (as in: upvoted) for his ignorance. That's one hell of a slippery slope... – l4mpi Oct 20 '13 at 13:14
  • @l4mpi I agree with you. This should have stayed at a 0, or perhaps increased in up-votes upon improvement. And working on adding documentation. – Games Brainiac Oct 20 '13 at 13:15

3 Answers3

3

There is nothing wrong with the regex, you are missing two things:

  1. Python does not have a regex type, so you have to wrap it in a string. Use a raw string so that the string is passed as-is to the regex compiler, without any escape interpretation
  2. The result of the .read() call is a byte sequence, not a string. So you need a byte sequence regex.

The second one is Python3-specific (and I see that you are using Py3)

Putting all together, just fix the aforementioned line like this:

pat = re.compile (rb'<img [^>]*src="([^"]+)')

r stands for raw and b for byte sequence.

Additionally, test on a website that actually embeds images in <img> tags, like http://stackoverflow.com. You will not find anything when processing http://google.com

Here we go:

Python 3.3.2+
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> import re
>>> website = urllib.request.urlopen('http://stackoverflow.com/')
>>> html = website.read()
>>> pat = re.compile (rb'<img [^>]*src="([^"]+)')
>>> img = pat.findall(html)
>>> img
[b'https://i.stack.imgur.com/tKsDb.png', b'https://i.stack.imgur.com/dmHl0.png', b'https://i.stack.imgur.com/dmHl0.png', b'https://i.stack.imgur.com/tKsDb.png', b'https://i.stack.imgur.com/6QN0y.png', b'https://i.stack.imgur.com/tKsDb.png', b'https://i.stack.imgur.com/L8rHf.png', b'https://i.stack.imgur.com/tKsDb.png', b'http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif']
Stefano Sanfilippo
  • 32,265
  • 7
  • 79
  • 80
1

Instead of using urllib, I used requests, you can download it from here. They do the same thing, I just like requests better since it has a better API. The regex string is only slightly changed. \s is just added in case there are a few whites spaces before the img tag. You were headed in the right direction. You can find out more about the re module here.

Here is the code

import requests
import re

website = requests.get('http://stackoverflow.com//')
html = website.text
pat = re.compile(r'<\s*img [^>]*src="([^"]+)')
img = pat.findall(html)

print img

And the output:

[u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/L8rHf.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/Ryr18.png', u'https://i.stack.imgur.com/ASf0H.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/Ryr18.png', u'https://i.stack.imgur.com/VgvXl.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/tKsDb.png', u'https://i.stack.imgur.com/6QN0y.png', u'http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif']
Games Brainiac
  • 80,178
  • 33
  • 141
  • 199
  • I will add one suggestion. This answer is good. The question would have been valid without any code to retrieve a web page. In the future, it might be worthwhile to make a function that finds what you want from a string or array of bytes. Then the function has only a single concern, finding a list of images. – Fred Mitchell Oct 20 '13 at 13:34
0

re.compile (r'<img [^>]*src="([^"]+)')

you are missing the quotation marks (single or double) around the pattern

mislavcimpersak
  • 2,880
  • 1
  • 27
  • 30
  • "and just to be sure it's good to escape quotation marks within the expresion" - what? That's more than wrong in this case... – l4mpi Oct 20 '13 at 12:41
  • agreed, but thanks for catching the missing quotation marks. now my error reads as: TypeError: can't use a string pattern on a bytes-like object – pythonintraining Oct 20 '13 at 12:45
  • it's a general remark regarding regex. in his case of parsing html he should catch both single and double quotation marks, but that is his job to do – mislavcimpersak Oct 20 '13 at 12:45
  • @mislav do you know what the `r` in front of the string means? "escaping" the quotation marks should only be done if they actually need to be escaped. Your regex matches `\"` instead of just the `"`. – l4mpi Oct 20 '13 at 12:49
  • i'm changing the answer just not do derail someone in the future to just include the remark about the missing quotes. worrying about quotes inside the regex for html is a whole new issue – mislavcimpersak Oct 20 '13 at 12:50
  • @l4mpi so you are saying that it's ok for him to write ```r"]*src="([^0"]+)"```? – mislavcimpersak Oct 20 '13 at 12:57
  • @l4mpi since he need to take care of both single and double quotes in html or else he will wind up with only partital results – mislavcimpersak Oct 20 '13 at 12:58
  • No, I'm saying YOUR REGEX IS WRONG because YOU are using `'` for your real string, meaning the inner `"` DOES NOT NEED TO BE ESCAPED and thus ESCAPING IT IS __WRONG__ because it DOES NOT MATCH THE CORRECT THINGS. Do I need to repeat this with even more capslock or bolding? – l4mpi Oct 20 '13 at 13:01