How do I delete the brackets surrounding a certain word in python

Question

Hi I am trying to make a code that deletes the angle brackets whenever the word 'div' appears between the angle brackets.

<div class="ipc-page-content-container ipc-page-content-container--center" role="presentation"><a class="ipc-button ipc-button--double-padding ipc-button--default-height ipc-button--core-baseAlt ipc-button--theme-baseAlt ipc-button imdb-footer__open-in-app-button" href="/whitelist-offsite?url=https%3A%2F%2Ftqp-4.tlnk.io%2Fserve%3Faction%3Dclick%26campaign_id_android%3D427112%26campaign_id_ios%3D427111%26destination_id_android%3D464200%26destination_id_ios%3D464199%26my_campaign%3Dmdot%2520sitewide%2520footer%2520%26my_site%3Dm.imdb.com%26publisher_id%3D350552%26site_id_android%3D133429%26site_id_ios%3D133428&amp;page-action=ft-gettheapp&amp;ref=ft_apps" tabindex="0"><div class="ipc-button__text">Get the IMDb App</div></a></div></div><div class="ipc-page-content-container ipc-page-content-container--center _2AR8CsLqQAMCT1_Q7eidSY" role="presentation">

For example, the <div class="ipc-page-content-container ipc-page-content-container--center" role="presentation"> would just become div class="ipc-page-content-container ipc-page-content-container--center" role="presentation" when using this code.

I tried to use regular expression to find div in the text, but I can't seem to find a way to delete the angle brackets.

import re
with open("movie.text.txt", 'rt', encoding='UTF8') as myfile:
    text = myfile.read()

    regex = "<div .+>"

    text = re.sub(regex, "div .+", text)

This code seems to delete every line of the text, and just replace it with div .+ Does anyone know how to make this code function properly?

I'd recommend using BeautifulSoup to parse html files instead of regex. See https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — Jeff, Aug 27 '20 at 08:08
I never thought of using BeautifulSoup, thanks for suggesting it! — Dylan, Aug 27 '20 at 08:10

score 0 · Answer 1 · answered Aug 27 '20 at 08:11

0

A simple name_of_string. replace ('>' or'<', '') should do the trick. The '' means nothing so they both get replaced with nothing. If this oneliner doesn't work, simply split the or statement and do one angle bracket at a time. string_name.replace('>','') then do same for the other angle bracket.

answered Aug 27 '20 at 08:11

ad stefnum

78
6

@Jeff is right though as this gives you more flexibility especially with cumbersome pages. – ad stefnum Aug 27 '20 at 08:13
But this will replace all tags not only divs – andy meissner Aug 27 '20 at 08:13
Also try <(div[^>]*)> with group(1) to retrieve the match – TonyR Aug 27 '20 at 08:18
Use if re.match('div') as condition and so if it's there perform the operation. – ad stefnum Aug 27 '20 at 08:23
t = re.search(p,s) if str(t) != "None":#OK print("$1:",t.group(1))#dd can also be used to check the match result – TonyR Aug 27 '20 at 09:12

score 0 · Answer 2 · answered Aug 27 '20 at 08:23

One way to do it would be like this:

regex = "<div .+>"

def matched(matchobj):
    return matchobj.group(0)[1:-1]

text = re.sub(regex, matched, text)

This gives a callback function to re.sub() which is called for every match and returns a replacement string. In this replacement string we just slice the first and last char of the string.

matchobj.group(0) will return the whole matched string

How do I delete the brackets surrounding a certain word in python

2 Answers2