0

I have a BeautifulSoup Paragraph as string. I want to replace the occurrences of p (opening) and /p (closing) tags in the string using regular expressions because there are instances like

    <p class="section-para">We would be happy to hear from you, Please 
    fill in the form below or mail us your requirements on<br/><span 
    class="text-red">contact@xyz.com</span></p> 

But I can not use the generic

    ^< *>$

because I want strong,b and h1,h1..h6 tags for different purposes.

I only know basics of RegEx but do not know how to make and use one. Can somebody help me with the making "inclusion","exclusion" (if there is any). How can I make one for this problem and how can I substitute with the simple ''

def formatting(string):
    this=['<h1>','</h1>','<h2>','</h2>','<h3>','</h3>','<h4>','</h4>','<h5>','</h5>','<h6>','</h6>','<b>','</b>','<strong>','</strong>']
    with_this=['\nh1 Tag:','\n','\nh2 Tag:','\n''\nh3 Tag:','\n''\nh4 Tag:','\n''\nh5 Tag:','\n''\nh6 Tag:','\n','\Bold:','\n''\nBold:','\n']

    for i in range(len(this)):
        if this[i] in string:
            string=string.replace(this[i],with_this[i])
    return(string)

I have used replace functions of strings for h1,2...6 tags. Any help would be appreciated.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
Mahesh
  • 266
  • 1
  • 11
  • 6
    [TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/7505395) – Patrick Artner Jun 20 '19 at 13:00
  • 1
    If you have already gone to the effort of using beautiful soup to parse your html then dont use regex. Regex is never recommended for parsing / replacing html – Chris Doyle Jun 20 '19 at 13:01
  • 2
    Hi! Generally, parsing HTML with RegEx is considered a fools errand. See https://stackoverflow.com/a/1732454/3543867 . There are a lot of edge cases that you can run into. One comon reference for handling html is the BeautifulSoup module/library - https://stackoverflow.com/questions/11709079/parsing-html-using-python – Cinderhaze Jun 20 '19 at 13:02
  • Although it is said that Chuck Norris is able to use regex on html ... – Patrick Artner Jun 20 '19 at 13:03
  • While it's true that you can't write a regex that matches only valid html, and trying to do so anyway will summon the dread Zalgo into our reality, merely trying to match the opening and closing of a

    tag might actually be possible because you can't nest a

    inside another

    (or at least, you're not supposed to).

    – Kevin Jun 20 '19 at 13:06
  • @Kevin So what is the solution. Any link would do. – Mahesh Jun 20 '19 at 13:10
  • Actually, I think I misread the question. If you _only_ want to replace

    tags and nothing else, that's possible. But it looks like you also want to replace and . Those tags _can_ be nested, so you can't write a regex that matches them.

    – Kevin Jun 20 '19 at 13:14
  • What do you want to replace the

    with?

    – Chris Doyle Jun 20 '19 at 13:26
  • @ChrisDoyle just the simple ''. x='123'; x.replace('1',''); x=23 – Mahesh Jun 20 '19 at 13:41
  • @Kevin not even when I store the whole parsed value as a string in a variable and then process it character by character. I know it is the dumbest idea but I don't know any other method. – Mahesh Jun 20 '19 at 13:43

2 Answers2

3

Its not clear what exactly you want to replace but maybe the below can help, it will allow you to replace tags with text if thats what you need. Am sure you will be able to adjust fruther to make it do what you want. Also you didnt specify the version of BS you are using. I am using BS4. The function will take a Beautiful soup object, a tag to find, a prefix I.E what you want to replace the start tag with and a suffix, I.E what you want to replace the end tag with.

from bs4 import BeautifulSoup

def format_soup_tag(soup, tag, prefix, suffix):
    target_tag = soup.find(tag)
    target_tag.insert_before(prefix)
    target_tag.insert_after(suffix)
    target_tag.unwrap()

html = '<p class ="section-para">We would be happy to hear from you, <strong>Please fill in the form below</strong> or mail us your requirements on <br/><span class ="text-red" >contact@xyz.com</span></p>'
soup = BeautifulSoup(html, features="lxml")
print("###before modification###\n", soup, "\n")

format_soup_tag(soup, 'p', '\np tag: ', '\n')
print("###after p tag###\n", soup, "\n")

format_soup_tag(soup, 'strong', '\Bold: ', ' \Bold')
print("###after strong tag###\n", soup, "\n")

OUTPUT

###before modification###
 <html><body><p class="section-para">We would be happy to hear from you, <strong>Please fill in the form below</strong> or mail us your requirements on <br/><span class="text-red">contact@xyz.com</span></p></body></html> 

###after p tag###
 <html><body>
p tag: We would be happy to hear from you, <strong>Please fill in the form below</strong> or mail us your requirements on <br/><span class="text-red">contact@xyz.com</span>
</body></html> 

###after strong tag###
 <html><body>
p tag: We would be happy to hear from you, \Bold: Please fill in the form below \Bold or mail us your requirements on <br/><span class="text-red">contact@xyz.com</span>
</body></html> 
Chris Doyle
  • 10,703
  • 2
  • 23
  • 42
  • https://stackoverflow.com/questions/56688597/how-to-make-a-dataset-from-the-given-website-which-shows-p-tags-values-and-h1-h2 . This i s the original problem that I am working on. I am using BS4 – Mahesh Jun 20 '19 at 14:52
  • Thanks for the answer that was helpful. also I want to add the h1,h2,h3 ,values also like you have done for strong, if present in the paragraph. – Mahesh Jun 20 '19 at 14:58
  • 2
    The function is written in such a way as to be flexible so you can add more lines if you like such as `format_soup_tag(soup, 'h1', '\nh1 tag: ', '\n')` you could even make a dictionary of the tag names with an array of the prefix and suffix and iterate over the dict – Chris Doyle Jun 20 '19 at 15:04
-1

I hope I understood you correctly and please correct me if I'm wrong. You have something like:

<p class="section-para">We would be happy to hear from you, Please 
fill in the form below or mail us your requirements on<br/><span 
class="text-red">contact@xyz.com</span></p>

And want something like:

<p>We would be happy to hear from you, Please 
fill in the form below or mail us your requirements on<br/><span 
class="text-red">contact@xyz.com</span></p>

You can simply do:

saved_content = re.search(
    '<p (.*?)>(?P<content>.*)</p>',
    your_string
).groupdict()

result = re.sub(
    r'<p (.*?)>(.*)</p>',
    f'<p>{saved_content.get("content")}</p>',
    your_string
)

Please note, that I've used f-strings, which are only available in Python 3.6 or higher. I hope it helped you and let me know if I misunderstood anything or if any questions are left. Have a nice day!

Train
  • 585
  • 1
  • 6
  • 20
  • Thank you for the answer it is somewhat like this. But the original work I want to do is like this.- https://stackoverflow.com/questions/56688597/how-to-make-a-dataset-from-the-given-website-which-shows-p-tags-values-and-h1-h2 – Mahesh Jun 20 '19 at 14:53
  • If this answer is the correct one for this particular question, you should mark it as correct answer. I'll have a look at the other questions, too. – Train Jun 20 '19 at 14:55
  • yuor other question doesnt make sense. why would there be any heading (h1,h2...h6) tag inside a p tag. the heading tag would essentially clsoe the p tag in terms of the html tree structure. If you have a look at the FAQ the best way to get help it to create a minimum complete example of what your trying to do. That means provide a sample of your input. show what you expect the output to be and show any code you have tried. – Chris Doyle Jun 20 '19 at 15:01