Beautiful Soup crashes Python process

Question

My Python process that processes multiple pages on one website crashes on the line:

soup = BeautifulSoup(cleaned_html, "lxml")

Moreover, every time it is a different page.

I use Python 2.7, bs4 0.0.1, and lxml 3.6.0.

Could you please help me? Thanks in advance!

My code:

 def clean_html(self, html, document_format):

        """ This function cleans and rearranges HTML and retearnes beautiful soup """

        cleaned_html = html

        # Remove all unimportant tags, except for the ones used by Abbyy
        cleaned_html = self.remove_unimportant_tags_except_for_p_b_font_a(cleaned_html)

        # Replace "nbsp;" with " "
        cleaned_html = self.replace_html_symbols(cleaned_html)

        # Remove extra spaces
        cleaned_html = self.remove_extra_space(cleaned_html)

        # Adjust html for the files from Abbyy
        if document_format == 'abbyy':
            logger.info("Record is made by Abbyy")
            cleaned_html = self.adjust_abbyy_tags(cleaned_html)
        elif document_format == 'sec':
            logger.info("Record is a SEC document")
            cleaned_html = self.adjust_sec_tags(cleaned_html)

        # Remove the unimportant tags used by Abbyy
        cleaned_html = self.remove_p_b_font_a(cleaned_html)

        # Remove extra spaces
        cleaned_html = self.remove_extra_space(cleaned_html)

        logger.info("HTML is cleaned before making soup")

        # Make soup
        try:
            if document_format in ("abbyy", "sec"): soup = BeautifulSoup(cleaned_html, "html5lib")
            else: soup = BeautifulSoup(cleaned_html, "lxml")
        except Exception as e:
            logger.warning("Beautiful soup cannot be made out of this page: {}".format(str(e)))
            return None  

        logger.info("Soup is made") 

        # Remove script and style tag containers with their content
        [s.extract() for s in soup('script')]
        [s.extract() for s in soup('style')]
        [s.extract() for s in soup('del')]
        [s.extract() for s in soup('s')]
        [s.extract() for s in soup('strike')]
        [s.extract() for s in soup('base')]
        [s.extract() for s in soup('basefont')]
        [s.extract() for s in soup('noscript')]
        [s.extract() for s in soup('applet')]
        [s.extract() for s in soup('embed')]
        [s.extract() for s in soup('object')]

        logger.info("Soup is cleaned") 

        return soup

If I do not specify "lxml", I get the following notification:

C:\Users\EERMIL~1\AppData\Local\Temp\2\_MEI38~1\bs4\__init__.py:166: UserWarning
: No parser was explicitly specified, so I'm using the best available HTML parse
r for this system ("lxml"). This usually isn't a problem, but if you run this co
de on another system, or in a different virtual environment, it may use a differ
ent parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

If I use "html5lib" instead of "lxml", the Python process does not crash, but I cannot get all text out of the HTML page. Namely I get the following error (which I catch as you see below)

'NoneType' object has no attribute 'next_element'

when I execute the following code:

 for child in soup.children:

            # If it is an irregular tag, skip it
            if str(type(child)) == "<class 'bs4.element.Tag'>":
                # If name has strange symbols, skip it
                if re.search('[^a-z0-9]', child.name):
                    continue
                # If there is no text inside, skip it
                try:
                    if not re.search('(\w|\d)', child.get_text()):
                        continue
                except Exception as e:
                    logger.warning("Unexpected exception in getting text from tag {}: {}".format(str(child), str(e)))
                    continue

This is a very common error, have you searched on this site for similar questions? In order to find a good duplicate, we need to see your code *and* the backtrace when it fails. — tripleee, Mar 01 '17 at 13:13
It's likely that the error is on the previous line, maybe some mismatched parentheses. — ForceBru, Mar 01 '17 at 13:15
The code crashes exactly in the line I mentioned. There is no error notification. — Ekaterina Ermilova, Mar 01 '17 at 13:19
Does it work if you switch lxml to one of the other available HTML backends? — tripleee, Mar 01 '17 at 13:20
If you want help, you need to provide a complete minimal verifiable example. If you don't know what that means, there's a description in the help center. As it stands, your question will almost certainly be closed as unclear what you are asking. We're smart people, but we can't magically see into the runtime state of your computer. — Jared Smith, Mar 01 '17 at 13:21
@JaredSmith The link is simply [/help/mcve](/help/mcve). If indeed this is happening randomly, it's not easy to come up with a MCVE. — tripleee, Mar 01 '17 at 13:22
If I'm not mistaken, you can just drop the second argument of `BeautifulSoup`. This will make it choose the best parser available. — ForceBru, Mar 01 '17 at 13:23
@tripleee thanks for the tip on the link. As for the other, what are the odds that is actually just happening randomly? BeatifulSoup is quite popular. — Jared Smith, Mar 01 '17 at 13:24
Above this code, there is only code for some cleaning of HTML of the page, e.g. removing some tags. Does it make sence to publish it? Please note, that every time Python crashes on a different page, while code for cleaning is exactly the same. — Ekaterina Ermilova, Mar 01 '17 at 13:24
If the same piece of cleaned HTML sometimes works and sometimes doesn't, it might not be worth posting it here; but as it is, we have very, very little to go on. — tripleee, Mar 01 '17 at 13:25
Yes, please post it as long as its not a 300 line wall of text. — Jared Smith, Mar 01 '17 at 13:25
This looks vaguely tangentially related. http://stackoverflow.com/questions/24563148/beautifulsoup-object-will-not-pickle-causes-interpreter-to-silently-crash This sounds spot on, but is allegedly fixed in lxml 2.3.5. http://stackoverflow.com/questions/16259567/parsing-a-specific-website-crashes-the-python-process — tripleee, Mar 01 '17 at 13:26

Beautiful Soup crashes Python process

0 Answers0