I need to findall PHP tags, but I am having trouble when running into a class calling a method with "->". It picks up the ">" as the end tag.
PHP tag: <html><body> Blah Blah Blah... <h2>Section Heading <?php $playFrom->time("09:58"); ?></h2>Blah blah blah </body></html>
My Code:
taglist = soup.findAll("?php")
for tag in taglist:
tag.replaceWith("")
replaced with <h2>Section Heading time("09:58"); ?>
Can BeautifulSoup do this? If so, what is the proper way?
EDIT(1): As Ryan points out:
"PHP isn’t HTML, so you can’t really parse it with an HTML parser."
I have discovered that the soup parser automatically removes the PHP and leaves behind scraps that are all within the text of the <h2>
tags. So my solution is to clean up that text with findall('h2')
... text.replace('badstuff', 'good stuff')
... My new question is, since lxml is the default parser (as per this link: Set lxml as default BeautifulSoup parser), shouldn't I still be able to find a way to delete the PHP cleanly using BS4?
NOTE (my solution): By eliminating the findAll("?php")...
code above I get the following result for the <h2>
tags by just letting BS4 soup parse the HTML.
<h2>Section Heading <?php $playFrom->time("09:58"); ?></h2>
becomes this:
<h2>Section Heading time("09:58"); ?></h2>
The the above result is from:
soup = BeautifulSoup(html.read(),'lxml')
print(soup.body.h2)
html.close()
The following code version cleans that up:
soup = BeautifulSoup(html.read(),'lxml')
h2list = soup.findAll("h2")
for tag in h2list:
text = text.replace('time("', '(')
text = text.replace('\"); ?>', ')')
tag.string = text
print(soup.body.h2)
html.close()
Producing this:
<h2>Section Heading (09:58)</h2>