Is there anyway to make changes to string in the capture group in re.sub?

Question

So I replace the the link with the text of the link

text = re.sub('<a href=\".*?\">(.*?)</a>','\\1',text)

example:

>>>text="<a href="SOME URL">SOME URL</a>"
>>>text = re.sub('<a href=\".*?\">(.*?)</a>','\\1',text)
>>>print text
SOME URL

I would like it to output some_url

but adding .lower().replace(' ','_') doesn't help

>>>text = re.sub('<a href=\".*?\">(.*?)</a>','\\1'.lower().replace(' ','_'),text)
SOME URL

Adam Smith · Accepted Answer · 2015-05-06T17:55:03.703

2

Sure. re.sub accepts a callable for its repl argument. The docs make it pretty clear but here's an example:

import re

re.sub(r'<a href=\".*?\">(.*?)</a>',
       lambda match: match.group(1).lower().replace(' ','_'),
       text)

edited May 06 '15 at 17:55

answered May 06 '15 at 17:44

Adam Smith

close enough, but s is the match object so need to add .group(1) re.sub('(.*?)', lambda s: s.group(1).lower().replace(' ','_'), text) please add it to your reply so I can mark it as correct answer :) – KameeCoding May 06 '15 at 17:51
Oh so it is. That teaches me to link the docs and not *read* them! :) – Adam Smith May 06 '15 at 17:54
@Kameegaming that said, Davidos's suggestion to use a parsing library is generally best practice for trying to deal with X/HTML. [This famous answer](http://stackoverflow.com/a/1732454/3058609) documents one answerer's descent into madness trying to stop regex+HTML – Adam Smith May 06 '15 at 17:57
1

Thanks for the recommendation but I am not actually dealing with HTML, only a parsed wikipedia dump with a bunch of these "URLs" in them so I am really just text matching rather than parsing html – KameeCoding May 06 '15 at 18:20

enthus1ast · Answer 2 · 2015-05-06T18:04:17.923

1

for this kind of task i would consider a more mature package eg.: beautiful soup:

from bs4 import BeautifulSoup    
BeautifulSoup('<a href="SOME URL">SOME URL</a>').find("a").text
    u'SOME URL'

edited May 06 '15 at 18:04

answered May 06 '15 at 17:43

enthus1ast

agreed with the recommendation to use bs4, but as OP points out: he probably doesn't want to lowercase and un-space-ify the whole string. – Adam Smith May 06 '15 at 17:49
then he should be more specific. – enthus1ast May 06 '15 at 17:51
@DavidosKrausos the title is "Make changes to [the] string in the capture group." That seems pretty specific to me! :P – Adam Smith May 06 '15 at 17:55
This is the right answer when someone is dealing with pure HTML I guess, for my case this wasn't the best solution as I only deal with these url markups and nothing else. – KameeCoding May 06 '15 at 18:32

2 Answers2