0

I am using python to scrape some info off IMDb and am looking to replace a given text + 12 characters that follow it with a blank. Is this possible? Here is an example:

I have the string

'<a href="/name/nm2142796/">Santino Rice</a> tells Roxxxy Andrews that she was "like Chewbaca in drag."'

And would like to replace the '<a href="/name/nm2142796/">' with '', but is there a way to do something like:

string.replace('<a href="/name/'+12,'')

it comes up quite a bit, but the nm####### is always different (it is always 7 digits following the nm though).

rjbogz
  • 860
  • 1
  • 15
  • 36
  • What about the ``? – Tim Pietzcker Jul 13 '13 at 20:17
  • yea, that's easy though `string.replace('','')` haha – rjbogz Jul 13 '13 at 20:18
  • Did you read the [Conditions of Use](http://www.imdb.com/help/show_article?conditions): **Robots and Screen Scraping**: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below. – Tim Pietzcker Jul 13 '13 at 20:19
  • ...especially since there's an [API](http://stackoverflow.com/questions/1966503/does-imdb-provide-an-api)... – Tim Pietzcker Jul 13 '13 at 20:21

2 Answers2

3

This is strictly what your're asking for:

import re
re.sub('<a href="/name/.{9}', '', string)

Replaces the string and 9 more characters.

re.sub('<a href="/name/[^>]*>',  '', string)

would also work, without relying on a number of characters.

But of course it would be better to use a html parse istead of trying to clean html using string manipulation. BeautifulSoup for example, or lxml, htmlparser... pick one.

mata
  • 67,110
  • 10
  • 163
  • 162
1

If you want to keep regex out of it, you could do something like this:

string.replace('<a href="/name/','')[12:]

Or you could replace using a regex:

import re
re.sub(r'<a href="/name/nm[\d]+/">', '', string)
mjacksonw
  • 491
  • 4
  • 8