5

I'm cleaning some text from Reddit. When you include a link in a Reddit self-text, you do so like this: [the text you read](https://website.com/to/go/to). I'd like to use regex to remove the hyperlink (e.g. https://website.com/to/go/to) but keep the text you read.

Here is another example:

[the podcast list](https://www.reddit.com/r/datascience/wiki/podcasts)

I'd like to keep: the podcast list.

How can I do this with Python's re library? What is the appropriate regex?

Ishaan Javali
  • 1,711
  • 3
  • 13
  • 23
Demetri Pananos
  • 6,770
  • 9
  • 42
  • 73
  • Please see this answer to a similar question about parsing html with regex: https://stackoverflow.com/a/1732454/3434388 – Danielle M. Dec 30 '18 at 18:08
  • I know it has a decent learning curve, but check out https://www.crummy.com/software/BeautifulSoup/ for parsing html – Danielle M. Dec 30 '18 at 18:09

1 Answers1

7

I have created an initial attempt at your requested regex:

(?<=\[.+\])\(.+\)

The first part (?<=...) is a look behind, which means it looks for it but does not match it. You can use this regex along with re's method sub. You can also see the meanings of all the regex symbols here.

You can extend the above regex to look for only things that have weblinks in the brackets, like so:

(?<=\[.+\])\(https?:\/\/.+\)

The problem with this is that if the link they provide is not started with an http or https it will fail.

After this you will need to remove the square brackets, maybe just removing all square brackets works fine for you.


Edit 1:

Valentino pointed out that substitute accepts capturing groups, which lets you capture the text and substitute the text back in using the following regex:

\[(.+)\]\(.+\)

You can then substitute the first captured group (in the square brackets) back in using:

re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)

If you want to look at the regex in more detail (if you're new to regex or want to learn what they mean) I would recommend an online regex interpreter, they explain what each symbol does and it makes it much easier to read (especially when there are lots of escaped symbols like there are here).

Adam Dadvar
  • 384
  • 1
  • 7
  • 2
    Also, using re.sub() the lookbehind can be avoided. `re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)` this will substitute the contents of the square bracket to the whole match. Just another way to it similar to yours. – Valentino Dec 30 '18 at 20:37
  • Thanks @Valentino, added! – Adam Dadvar Dec 30 '18 at 21:42
  • This solution only works for up to 1 link in `original_text`. By matching smallest group within the brackets instead it works nicely for more than 1 link as well: `\[(.+?)\]\(.+?\)` – sigurdo Sep 06 '22 at 15:13