0

I'm trying to get the results from html file using BeautifulSoup:

with open(r'/home/maria/Desktop/iqyylog.html', "r") as f:
    page = f.read()
soup = BeautifulSoup(page, 'html.parser')
for tag in soup.find_all('details'):
    print tag

The problem here is basically iqyylog.html file contains more than 2500 nodes. While parsing, it is taking time to load the data. Is there any other way to parse HTML file with large data. When I'm using lxml parser it is taking only first 25 nodes.

Maria
  • 99
  • 2
  • 10
  • Please check this link https://stackoverflow.com/questions/31201434/using-beautifulsoup-on-very-large-html-file-memory-error – Karthik Sep 02 '20 at 07:35

1 Answers1

0

Try this.

from simplified_scrapy import SimplifiedDoc, utils

html = utils.getFileContent(r'test.html')
doc = SimplifiedDoc(html)
details = doc.selects('details')
for detail in details:
    print(detail.tag)

If you still have problems, try the following.

import io
from simplified_scrapy import SimplifiedDoc, utils
def getDetails(fileName):
    details = []
    tag = 'details'
    with io.open(fileName, "r", encoding='utf-8') as file:
        # Suppose the start and end tags are not on the same line, as shown below
        # <details>
        #   some words
        # </details>
        line = file.readline()  # Read data line by line
        stanza = None # Store a details node
        while line != '':
            if line.strip() == '':
                line = file.readline()
                continue
            if stanza and line.find('</' + tag + '>') >= 0:
                doc = SimplifiedDoc(stanza + '</' + tag + '>')  # Instantiate a doc
                details.append(doc.select(tag))
                stanza = None
            elif stanza:
                stanza = stanza + line
            else:
                if line.find('<' + tag) >= 0:
                    stanza = line

            line = file.readline()
    return details


details = getDetails('test.html')
for detail in details:
    print(detail.tag)
dabingsou
  • 2,469
  • 1
  • 5
  • 8