0

EDIT: Figured it out. I just did the following:

import sys

sys.setrecursionlimit(1500) #This increases the recursion limit, ultimately moving
#up the ceiling on the stack so it doesn't overflow.

Check out this post for more info: What is the maximum recursion depth in Python, and how to increase it?

--------------ORIGINAL QUESTION-----------------

I'm currently scraping webpages for dates. As of now, I'm succesfully pulling the dates in the format I'm searching for using re.findall, but once I get to about the 33rd link, I get a "Maximum recursion depth exceeded while calling a Python object" error and it keeps pointing to the dates = re.findall(regex, str(webpage)) object.

From what I've read, I need to employ a loop within my code so that I can get rid of the recursion, but as a novice, I'm unsure how to change the piece of code dealing with the RegEx and re.findall from recursive to iterative. Thanks in advance for any insights.

import urllib2
from bs4 import BeautifulSoup as BS
import re

#All code is correct between imports and the start of the For loop

for url in URLs:
    ...

    #Open and read the URL and specify html.parser as the parsing agent so that the parsing method remains uniform across systems
    webpage = BS(urllib2.urlopen(req).read(), "html.parser")

    #Create a list to store the dates to be searched
    regex = []

    #Append to a list those dates that have the end year "2011"
    regex.append("((?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec)[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))")

    #Join all the dates matched on the webpage from the regex by a comma
    regex = ','.join(regex)

    #Find the matching date format from the opened webpage 
    #[Recursion depth error happens here]
    dates = re.findall(regex, str(webpage))

    #If there aren't any dates that match, then go to the next link
    if dates == []:
        print "There was no matching date found in row " + CurrentRow
        j += 1
        continue

    #Print the dates that match the RegEx and the row that they are on
    print "A date was found in the link at row " + CurrentRow
    print dates
    j += 1
Community
  • 1
  • 1
CopyLeft
  • 321
  • 4
  • 6
  • because of the use of a lot of `or` statements in your pattern, there is a LOT of recursion in just running the pattern. on top of that, `re.findall` uses recursion as well to append all matches to a list. Putting it all together to scrape a whole url for dates will then obviously run into some problems for large url strings and lots of dates. – R Nar Oct 26 '15 at 16:46
  • Okay. I understand the problem is recursion, but how can I change this process to an iterative one in order to remedy the problem? My goal is to scrape the HTML of each website and match it to a given month/day/year format. If you could offer general steps or, better yet, actual code to create an iterative process, that would be much appreciated. As I said, I don't understand how to change my current recursive process into an iterative one. – CopyLeft Oct 26 '15 at 19:06

2 Answers2

0

I don't think

regex.append("...")

is doing what you think it should be doing.

After then append method is called, regex now is a one element array holding your regular expression. The following join indicates to me that you think it should be a multi-element array somehow.

Once you fix that, I suspect your code will work better.

Erik
  • 898
  • 8
  • 20
  • Is there any way you could offer an iterative solution to this code? Ultimately, my goal is to scrape each of the webpages I have in a list to find the dates on it that match a given format through iterative means. Even just a general solution with or without code would be helpful. – CopyLeft Oct 26 '15 at 19:04
0

Continuing with my comment, what you could do is create a lot of different patterns and iterate through each of those instead of using one pattern with a lot of different OR statements. Something like this might work:

regex = "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec"
regex = ["((?:"+month+")[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))" for month in regex.split("|")]

matches = []
for pattern in regex:
    matches.append(re.findall(pattern, str(webpage))

This is a more iterative way of doing this BUT this is super slow. this is because it will run re.findall for every month type EVERY SINGLE WEBPAGE. as you can see, if you have at least 33 links like you say in your question, this would be 24*33 runs of re.findall. additionally, I am not a python expert by any means and I am not even completely sure that this solution would get rid of your problem entirely.

R Nar
  • 5,465
  • 1
  • 16
  • 32