-1

Hi I am trying to make a code that deletes the angle brackets whenever the word 'div' appears between the angle brackets.

<div class="ipc-page-content-container ipc-page-content-container--center" role="presentation"><a class="ipc-button ipc-button--double-padding ipc-button--default-height ipc-button--core-baseAlt ipc-button--theme-baseAlt ipc-button imdb-footer__open-in-app-button" href="/whitelist-offsite?url=https%3A%2F%2Ftqp-4.tlnk.io%2Fserve%3Faction%3Dclick%26campaign_id_android%3D427112%26campaign_id_ios%3D427111%26destination_id_android%3D464200%26destination_id_ios%3D464199%26my_campaign%3Dmdot%2520sitewide%2520footer%2520%26my_site%3Dm.imdb.com%26publisher_id%3D350552%26site_id_android%3D133429%26site_id_ios%3D133428&amp;page-action=ft-gettheapp&amp;ref=ft_apps" tabindex="0"><div class="ipc-button__text">Get the IMDb App</div></a></div></div><div class="ipc-page-content-container ipc-page-content-container--center _2AR8CsLqQAMCT1_Q7eidSY" role="presentation">

For example, the <div class="ipc-page-content-container ipc-page-content-container--center" role="presentation"> would just become div class="ipc-page-content-container ipc-page-content-container--center" role="presentation" when using this code.

I tried to use regular expression to find div in the text, but I can't seem to find a way to delete the angle brackets.

import re
with open("movie.text.txt", 'rt', encoding='UTF8') as myfile:
    text = myfile.read()

    regex = "<div .+>"

    text = re.sub(regex, "div .+", text)

This code seems to delete every line of the text, and just replace it with div .+ Does anyone know how to make this code function properly?

Dylan
  • 103
  • 6
  • I'd recommend using BeautifulSoup to parse html files instead of regex. See https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Jeff Aug 27 '20 at 08:08
  • I never thought of using BeautifulSoup, thanks for suggesting it! – Dylan Aug 27 '20 at 08:10

2 Answers2

0

A simple name_of_string. replace ('>' or'<', '') should do the trick. The '' means nothing so they both get replaced with nothing. If this oneliner doesn't work, simply split the or statement and do one angle bracket at a time. string_name.replace('>','') then do same for the other angle bracket.

ad stefnum
  • 78
  • 6
0

One way to do it would be like this:

regex = "<div .+>"

def matched(matchobj):
    return matchobj.group(0)[1:-1]

text = re.sub(regex, matched, text)

This gives a callback function to re.sub() which is called for every match and returns a replacement string. In this replacement string we just slice the first and last char of the string.

matchobj.group(0) will return the whole matched string

andy meissner
  • 1,202
  • 5
  • 15