0

So I replace the the link with the text of the link

text = re.sub('<a href=\".*?\">(.*?)</a>','\\1',text)

example:

>>>text="<a href="SOME URL">SOME URL</a>"
>>>text = re.sub('<a href=\".*?\">(.*?)</a>','\\1',text)
>>>print text
SOME URL

I would like it to output some_url

but adding .lower().replace(' ','_') doesn't help

>>>text = re.sub('<a href=\".*?\">(.*?)</a>','\\1'.lower().replace(' ','_'),text)
SOME URL
KameeCoding
  • 693
  • 2
  • 9
  • 27

2 Answers2

2

Sure. re.sub accepts a callable for its repl argument. The docs make it pretty clear but here's an example:

import re

re.sub(r'<a href=\".*?\">(.*?)</a>',
       lambda match: match.group(1).lower().replace(' ','_'),
       text)
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • close enough, but s is the match object so need to add .group(1) re.sub('(.*?)', lambda s: s.group(1).lower().replace(' ','_'), text) please add it to your reply so I can mark it as correct answer :) – KameeCoding May 06 '15 at 17:51
  • Oh so it is. That teaches me to link the docs and not *read* them! :) – Adam Smith May 06 '15 at 17:54
  • @Kameegaming that said, Davidos's suggestion to use a parsing library is generally best practice for trying to deal with X/HTML. [This famous answer](http://stackoverflow.com/a/1732454/3058609) documents one answerer's descent into madness trying to stop regex+HTML – Adam Smith May 06 '15 at 17:57
  • 1
    Thanks for the recommendation but I am not actually dealing with HTML, only a parsed wikipedia dump with a bunch of these "URLs" in them so I am really just text matching rather than parsing html – KameeCoding May 06 '15 at 18:20
1

for this kind of task i would consider a more mature package eg.: beautiful soup:

from bs4 import BeautifulSoup    
BeautifulSoup('<a href="SOME URL">SOME URL</a>').find("a").text
    u'SOME URL'
enthus1ast
  • 2,099
  • 15
  • 22
  • agreed with the recommendation to use bs4, but as OP points out: he probably doesn't want to lowercase and un-space-ify the whole string. – Adam Smith May 06 '15 at 17:49
  • then he should be more specific. – enthus1ast May 06 '15 at 17:51
  • @DavidosKrausos the title is "Make changes to [the] string in the capture group." That seems pretty specific to me! :P – Adam Smith May 06 '15 at 17:55
  • This is the right answer when someone is dealing with pure HTML I guess, for my case this wasn't the best solution as I only deal with these url markups and nothing else. – KameeCoding May 06 '15 at 18:32