Question on using string.replace and regex to clean data

Question

I've asked this question once but it was closed because the commenters directed me to some other posts. But those posts didnt specifically use str.replace and that's what I'm supposed to use. Maybe those worked but I still don't understand how to do it.

This is the question:

This is trump['source']:

trump['source'] = array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="https://studio.twitter.com" rel="nofollow">Media Studio</a>',
       '<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>',
       '<a href="http://instagram.com" rel="nofollow">Instagram</a>',
       '<a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M5)</a>',
       '<a href="https://ads.twitter.com" rel="nofollow">Twitter Ads</a>',
       '<a href="https://periscope.tv" rel="nofollow">Periscope</a>',
       '<a href="https://studio.twitter.com" rel="nofollow">Twitter Media Studio</a>'],
      dtype=object)

I'm fairly new to regex so I don't really know what to do, however, this is what I have right now:

r = r'>(.*)<'

as this captures what I want and groups it.

Does anyone know how to use str.replace with regex to get get rid of the tags and replace it with just "Twitter for Iphone" etc?

Thanks

S3DEV · Accepted Answer · 2020-09-24T18:36:54.663

You can use:

trump['source'].str.replace(r'^<a.*nofollow">(.*)<\/a>$', r'\1')

Output:

0      Twitter for iPhone
1     Twitter for Android
2      Twitter Web Client
3            Media Studio
4        Twitter for iPad
5               Instagram
6         Mobile Web (M5)
7             Twitter Ads
8               Periscope
9    Twitter Media Studio
Name: source, dtype: object

Notes:

Now, this is quite a 'lazy' (non-robust) regex pattern, but it suits the purpose of the task at hand. Here are the specifics of the pattern:

^ Start at the beginning of the string
<a Match the anchor tag at the start
.* Match 0+ occurrences of any character
nofollow"> Search for the "nofollow" element to know where to start capturing
(.*) Capture 0+ occurrences of any character (e.g. our subject text)
<\/a> Stop capturing at the closing anchor tag
$ Match the pattern until the end of the string

The second parameter of the replace() function is the text used as the replacement. In this case, replace with the first capture group, identified as r'\1'. Or you can use \\1.

Here is a link to a site I like to use to test regex patterns.

Question on using string.replace and regex to clean data

1 Answers1