python string.replace() a string + 12 characters following

Question

I am using python to scrape some info off IMDb and am looking to replace a given text + 12 characters that follow it with a blank. Is this possible? Here is an example:

I have the string

'<a href="/name/nm2142796/">Santino Rice</a> tells Roxxxy Andrews that she was "like Chewbaca in drag."'

And would like to replace the '<a href="/name/nm2142796/">' with '', but is there a way to do something like:

string.replace('<a href="/name/'+12,'')

it comes up quite a bit, but the nm####### is always different (it is always 7 digits following the nm though).

Did you read the [Conditions of Use](http://www.imdb.com/help/show_article?conditions): **Robots and Screen Scraping**: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below. — Tim Pietzcker, Jul 13 '13 at 20:19
...especially since there's an [API](http://stackoverflow.com/questions/1966503/does-imdb-provide-an-api)... — Tim Pietzcker, Jul 13 '13 at 20:21

mata · Accepted Answer · 2013-07-13T20:25:26.420

3

This is strictly what your're asking for:

import re
re.sub('<a href="/name/.{9}', '', string)

Replaces the string and 9 more characters.

re.sub('<a href="/name/[^>]*>',  '', string)

would also work, without relying on a number of characters.

But of course it would be better to use a html parse istead of trying to clean html using string manipulation. BeautifulSoup for example, or lxml, htmlparser... pick one.

edited Jul 13 '13 at 20:25

answered Jul 13 '13 at 20:19

mata

67,110
10
163
162

ahh of course, regex. Didn't even pop into my mind for some reason. Thanks! – rjbogz Jul 13 '13 at 20:20
would be `{12}` though to include the extra `/">` – rjbogz Jul 13 '13 at 20:21
ahh, no way to do it more than once? – rjbogz Jul 13 '13 at 20:27

score 1 · Answer 2 · answered Jul 13 '13 at 20:22

1

If you want to keep regex out of it, you could do something like this:

string.replace('<a href="/name/','')[12:]

Or you could replace using a regex:

import re
re.sub(r'<a href="/name/nm[\d]+/">', '', string)

answered Jul 13 '13 at 20:22

mjacksonw

491
4
8

well, regex isn't too bothersome, but how does `string.replace(' – rjbogz Jul 13 '13 at 20:26
Also, see my comment on the answer by mata, is there a way to have regex replace all occurrences? – rjbogz Jul 13 '13 at 20:28

python string.replace() a string + 12 characters following

2 Answers2