0

I need to findall PHP tags, but I am having trouble when running into a class calling a method with "->". It picks up the ">" as the end tag.

PHP tag: <html><body> Blah Blah Blah... <h2>Section Heading <?php $playFrom->time("09:58"); ?></h2>Blah blah blah </body></html>

My Code:

taglist = soup.findAll("?php")
for tag in taglist:
    tag.replaceWith("")

replaced with <h2>Section Heading time("09:58"); ?&gt;

Can BeautifulSoup do this? If so, what is the proper way?

EDIT(1): As Ryan points out:

"PHP isn’t HTML, so you can’t really parse it with an HTML parser."

I have discovered that the soup parser automatically removes the PHP and leaves behind scraps that are all within the text of the <h2> tags. So my solution is to clean up that text with findall('h2')... text.replace('badstuff', 'good stuff')... My new question is, since lxml is the default parser (as per this link: Set lxml as default BeautifulSoup parser), shouldn't I still be able to find a way to delete the PHP cleanly using BS4?

NOTE (my solution): By eliminating the findAll("?php")... code above I get the following result for the <h2> tags by just letting BS4 soup parse the HTML.

<h2>Section Heading <?php $playFrom->time("09:58"); ?></h2>

becomes this:

<h2>Section Heading time("09:58"); ?&gt;</h2>

The the above result is from:

soup = BeautifulSoup(html.read(),'lxml')
print(soup.body.h2)
html.close()

The following code version cleans that up:

soup = BeautifulSoup(html.read(),'lxml') 

h2list = soup.findAll("h2")
for tag in h2list:
    text = text.replace('time("', '(')
    text = text.replace('\"); ?>', ')')
    tag.string = text

print(soup.body.h2)
html.close()

Producing this:

<h2>Section Heading (09:58)</h2>

Community
  • 1
  • 1
ajnabi
  • 167
  • 2
  • 3
  • 14
  • PHP isn’t HTML, so you can’t really parse it with an HTML parser. What do you need this for? Is it to strip out dangerous things from user input, or something else? – Ry- Apr 02 '17 at 22:17
  • I have a bunch of HTML files with this code in it, and I want to strip it out. There are multiple lines of this PHP code throughout each file, and there are hundreds of files. The good thing is the files are all consistently coded. – ajnabi Apr 02 '17 at 22:26
  • Beautiful Soup is a Python library for pulling data out of HTML and XML files. It's doesn't work for PHP tag. – frfahim Apr 02 '17 at 22:36
  • OK, thanks. I have pre-processed the files with sed, removing the "->" and then BeautifulSoup picked up the PHP tags and deleted them no problem. I just wanted to know if there was a way to make BS do it. – ajnabi Apr 02 '17 at 22:44
  • I have discovered that this method removes some good HTML as well as the bad PHP code. I have revised my question. Please see edit(1) of original post. – ajnabi Apr 03 '17 at 11:40
  • Simply put, do you want to strip all PHP code from your files? – Bill Bell Apr 03 '17 at 18:03
  • The bottom line is, can I use BS4 with lxml parser to delete or otherwise manipulate PHP tags. In this case, I ideally want to keep the time in parenthesis. It would seem to me that sometimes HTML has some PHP embedded and BS4 would have a way to deal with it. – ajnabi Apr 04 '17 at 15:13

0 Answers0