1

I've asked this question once but it was closed because the commenters directed me to some other posts. But those posts didnt specifically use str.replace and that's what I'm supposed to use. Maybe those worked but I still don't understand how to do it.

This is the question: enter image description here

This is trump['source']:

trump['source'] = array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="https://studio.twitter.com" rel="nofollow">Media Studio</a>',
       '<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>',
       '<a href="http://instagram.com" rel="nofollow">Instagram</a>',
       '<a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M5)</a>',
       '<a href="https://ads.twitter.com" rel="nofollow">Twitter Ads</a>',
       '<a href="https://periscope.tv" rel="nofollow">Periscope</a>',
       '<a href="https://studio.twitter.com" rel="nofollow">Twitter Media Studio</a>'],
      dtype=object)

I'm fairly new to regex so I don't really know what to do, however, this is what I have right now:

r = r'>(.*)<'

as this captures what I want and groups it.

Does anyone know how to use str.replace with regex to get get rid of the tags and replace it with just "Twitter for Iphone" etc?

Thanks

user3085496
  • 175
  • 1
  • 2
  • 10

1 Answers1

1

You can use:

trump['source'].str.replace(r'^<a.*nofollow">(.*)<\/a>$', r'\1')

Output:

0      Twitter for iPhone
1     Twitter for Android
2      Twitter Web Client
3            Media Studio
4        Twitter for iPad
5               Instagram
6         Mobile Web (M5)
7             Twitter Ads
8               Periscope
9    Twitter Media Studio
Name: source, dtype: object

Notes:

Now, this is quite a 'lazy' (non-robust) regex pattern, but it suits the purpose of the task at hand. Here are the specifics of the pattern:

  • ^ Start at the beginning of the string
  • <a Match the anchor tag at the start
  • .* Match 0+ occurrences of any character
  • nofollow"> Search for the "nofollow" element to know where to start capturing
  • (.*) Capture 0+ occurrences of any character (e.g. our subject text)
  • <\/a> Stop capturing at the closing anchor tag
  • $ Match the pattern until the end of the string

The second parameter of the replace() function is the text used as the replacement. In this case, replace with the first capture group, identified as r'\1'. Or you can use \\1.

Here is a link to a site I like to use to test regex patterns.

Further reading you might find helpful:

S3DEV
  • 8,768
  • 3
  • 31
  • 42