17

I have a simple code like:

    p = soup.find_all("p")
    paragraphs = []

    for x in p:
        paragraphs.append(str(x))

I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:

"RuntimeError: maximum recursion depth exceeded while calling a Python object"

I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?

samuraiexe
  • 377
  • 1
  • 3
  • 11
  • Can you post an example xml file to pastebin or something (with sensitive data removed, if necessary)? I'm having trouble seeing why just calling `str` on a `

    ` element should cause a recursion depth error, unless you have tags nested to a depth of near 500.

    – senshin Jan 07 '14 at 10:07
  • I'm using a public data. The file can be found here http://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt. As I menntioned in the description, there are over 6000 paragraphs tags in p – samuraiexe Jan 07 '14 at 10:10
  • What's causing the problem are the binary graphic blocks at the bottom of the document, some of which contain the sequence `

    – senshin Jan 07 '14 at 10:17
  • @senshin: no, beautifulsoup works fine. The problems lies in converting each individual tags into strings, thus giving me a runtimeErorr – samuraiexe Jan 07 '14 at 10:22
  • Okay, if you think that's the issue, try this: add a counter to your for loop, and at each iteration, increment the counter by one and print out the value of the counter. Tell me what the value of the counter is when the `RuntimeError` occurs. – senshin Jan 07 '14 at 10:23
  • i goes to 6015. len(p) = 6040 – samuraiexe Jan 07 '14 at 10:30

2 Answers2

11

The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053] for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.

Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p> tags, try this:

# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()

p = soup.find_all('p')
paragraphs = []
for x in p:
    paragraphs.append(str(x))
senshin
  • 10,022
  • 7
  • 46
  • 59
  • 1
    I'm also using this code for other documents to, so I don't think throwing away the last document is the best option, but you are leading me on the right path. – samuraiexe Jan 07 '14 at 10:41
  • You'll want to adopt a strategy that basically involves looking at the `` of each `` and eliminating the ones for which the `` is `GRAPHIC`. This won't be elegant, because the data file has horribly malformed tags, but it should work. You could also try checking for `begin 644`, which appears in `GRAPHIC` documents only. As a last resort, try renaming `...` in the data file to `...`, and then iterating only over `soup.find_all('htmltwo')` rather than the entire soup. – senshin Jan 07 '14 at 11:02
  • @samuraiexe All of these could potentially fail if the text of the non-`GRAPHICS` parts of the data file contain some of the same strings you're filtering with, but I think that's something you're going to have to live with. – senshin Jan 07 '14 at 11:03
1

I believe the issue is that the BeautifulsSoup object p is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p'). Note the RecursionError is similarly thrown when building soup.prettify().

For my solution I used the re module to gather all <p>...</p> tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.

import re
import urllib
from urllib.request import Request, urlopen

url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'

response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)

paragraphs = []
for x in p:
    paragraphs.append(str(x))
Community
  • 1
  • 1
mattcan
  • 38
  • 6