0

In a unique case of html found on web there is a html document which has multiple html tags within the parent HTML tag. I want to parse the contents of the html tag. Can anyone point me in the direction to do so ?

Thanks in advance.

Edit 1: Using BeautifulSoup

soup = BeautifulSoup(html, "lxml")

gives only the parent html and the tags present within it.

However I am assuming if the browser is able to render the html BS should be able to parse it. is that assumption correct?

Edit 2: Actually the html is a malformed html ( i am assuming here), this is the html I am parsing with beautifulsoup somehow I am only getting the tables and and of 1st (outermost) html. If I manually remove the multiple HTML tags and only keep 1 html tag I am able to parse the table in BS. So the question is "Is there any way to parse the below html and get the data from the innermost or all tables in the file?

<!DOCTYPE html>
<html>
<head>
    <title>Some Title</title>
</head>
<body>
    some html to display the tables.
    <html>
        <head></head>
        <title>Some other title</title>
        <body>
            some html to display even more tables.
        </body>
    </html>
</body>
</html>
Kaustubh
  • 1
  • 3

2 Answers2

0

here is a sample code, you can use for finding text of particular inside a particular kind of html tag

soup2 = BeautifulSoup(x, 'html.parser')
    for i in soup2.find_all('ul', attrs={'class': 'results-base'}):
         for j in i.find_all('li'):
nishant kumar
  • 507
  • 10
  • 28
  • I have updated the question to contain more details could you please comment on that? Thanks in advance. – Kaustubh Jun 26 '17 at 08:45
0

Here are some sites that are relevant for your question,i think you can find a good answer for what you're looking for.

  1. http://www.compjour.org/warmups/govt-text-releases/intro-to-bs4-lxml-parsing-wh-press-briefings/
  2. Using BeautifulSoup to find a HTML tag that contains certain text
  3. Find index of tag with certain text in beautifulsoup/python
Mika Wolf
  • 102
  • 4