1

i am working on a regex match function in python. i have the following code:

def src_match(line, img):
    imgmatch = re.search(r'<img src="(?P<img>.*?)"', line)

    if imgmatch and imgmatch.groupdict()['img'] == img:
        print 'the match was:', imgmatch.groupdict()['img']

the above does not seem to operate correctly for me at all. i do on the other hand have luck with this:

def href_match(line, url):
    hrefmatch = re.search(r'<a href="(?P<url>.*?)"', line)

    if hrefmatch and hrefmatch.groupdict()['url'] == url:
        print 'the match was:', hrefmatch.groupdict()['url']
    else:
        return None

can someone please explain why this would be (or if maybe it seems like both should work)? for ex., is there something special about the identifier in the href_match() function? it can be assumed in both functions that i am passing both a line in that contains the string i am searching for, and the string itself.

EDIT: i should mention that i am sure i will never get a tag like:

<img width="200px" src="somefile.jpg"> 

the reason for this is that i am using a specific program which is generating the html and it will never yield a tag as such. this example should be taken as purely theoretical within the assumptions that i am always going to get a tag like:

<img src="somefile.jpg">

EDIT:

here is an example of a line that i am feeding to the function which does not match the input argument:

<p class="p1"><img src="myfile.anotherword.png" alt="beat-divisions.tiff"></p>
jml
  • 1,745
  • 6
  • 29
  • 55
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Pepe Feb 06 '11 at 03:27
  • see my response below, which also applies to your typical (as of late) link. this is not helpful in the least and doesn't answer the question. there is certainly an answer to my problem that will help me learn. – jml Feb 06 '11 at 03:33
  • as per my answer below, the functions both work for me (Python 2.7.1 on Windows 7 in the interactive shell). Can you give a counter-example of input that should work but fails? – Hugh Bothwell Feb 06 '11 at 03:50
  • i put an example that fails above in an edit. thanks for taking a look. – jml Feb 06 '11 at 03:53

1 Answers1

1

Rule #37: do not attempt parsing HTML with regex.

Use the right tool for the job - in this case, BeautifulSoup.

Edit:

cut-and-pasting the function and testing as

>>> src_match('this is <img src="my example" />','my example')
the match was: my example

so it appears to function; however it will fail on (perfectly valid) HTML code like

<img width="200px" src="Y U NO C ME!!" />

Edit4:

>>> src_match('<p class="p1"><img src="myfile.png" alt="beat-divisions.tiff"></p>','myfile.png')
the match was: myfile.png
>>> src_match('<p class="p1"><img src="myfile.anotherword.png" alt="beat-divisions.tiff"</p>\n','myfile.anotherword.png')
the match was: myfile.anotherword.png

still works; are you sure the url value you are trying to match against is correct?

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
  • it's hilarious to me that i have to explain this each time i make a posting regarding this, but i'll say it again: i am not attempting to build an all-encompassing parser with this function. it's a tiny job, and it works in the other case. you will notice that i'm not attempting to parse anything but two distinct tags and i am wanting, more than anything, to learn more about regex in python in the process. – jml Feb 06 '11 at 03:18
  • thanks for your update. wouldn't that be ? – jml Feb 06 '11 at 03:36
  • FWIW, i know exactly what all of the tags will look like because i have a specific program generating the html. although your example is valid, it never yields html as such. – jml Feb 06 '11 at 03:37
  • i also updated my question to be more clear. thanks for your suggestions so far tho! – jml Feb 06 '11 at 03:42
  • hi again hugh: i did not include a pertinent difference: there are two periods in the string. what would i do in such a case? looking like more of a basic regex q now... you'll see my updated edit above. – jml Feb 06 '11 at 04:04
  • gah. i found the flippin' problem. i was constructing the list of files based on my directory tree, but the files weren't present on my hd in the directory where i was expecting them... so it never found the valid file. sorry for the noise. :( i really appreciate the help though. you helped me narrow it down quite a bit. – jml Feb 06 '11 at 04:26