0

For some reason when I run this code it keeps looping over the same object and is not getting any new items from the database. In other words, the print output is just the same object over and over, when it should be iterating over items in the list. Here is my code:

article = Article.objects.filter(is_locked=False, is_downloaded=False).first()
while article:
    article.is_locked = True
    article.save()

    print '******************************'
    date = article.datetime
    title = article.title
    url = article.url
    print('date: %s' % date)
    print('url: %s' % url)
    print('title: %s' % title)

    get_article(url, title, article)

    article = Article.objects.filter(is_locked=False, is_downloaded=False).first()

Where mldb.models is:

from django.db import models


class Article(models.Model):
    url = models.CharField(max_length=1028)
    title = models.CharField(max_length=1028)
    category = models.CharField(max_length=128)
    locale = models.CharField(max_length=128)
    section = models.CharField(max_length=512)
    tag = models.CharField(max_length=128)
    author = models.CharField(max_length=256)
    datetime = models.DateTimeField()
    description = models.TextField()
    article = models.TextField()
    is_locked = models.BooleanField(default=False)
    is_downloaded = models.BooleanField(default=False)

    def __str__(self):              # __unicode__ on Python 2
        return self.name

    class Meta:
        app_label = 'mldb'

I have also tried this but it also does not loop through objects either (the loop just repeats the same object over and over):

articles = Article.objects.filter(is_locked=False, is_downloaded=False)
for article in articles:
   ...

Here is get_article(). This seems to be what is causing the problem (if I remove the call to this function everything works properly):

def get_article(url, title, article):
    failed_attempts = 0
    while True:
        try:
            content = urllib2.urlopen(url).read()

            soup = BeautifulSoup(content, "html5lib")

            description = soup.find(property="og:description")["content"] if soup.find(property="og:description") else ''
            locale = soup.find(property="og:locale")["content"] if soup.find(property="og:locale") else ''
            section = soup.find(property="og:article:section")["content"] if soup.find(property="og:article:section") else ''
            tag = soup.find(property="og:article:tag")["content"] if soup.find(property="og:article:tag") else ''
            author = soup.find(property="og:article:author")["content"] if soup.find(property="og:article:author") else ''
            date = soup.find(property="og:article:published_time")["content"] if soup.find(property="og:article:published_time") else ''
            print 'date'
            print date

            body = ''
            for body_tag in soup.findAll("div", {"class" : re.compile('ArticleBody_body.*')}):
                body += body_tag.text

            # datetime.strptime (ts, "%Y") # 2012-01-02T04:32:57+0000
            dt = dateutil.parser.parse(date, fuzzy=True)
            print dt
            print url

            article.title = title.encode('utf-8')
            article.url = url.encode('utf-8')
            article.description = description.encode('utf-8')
            article.locale = locale.encode('utf-8')
            article.section = section.encode('utf-8')
            article.tag = tag.encode('utf-8')
            article.author = author.encode('utf-8')
            article.body = body.encode('utf-8')
            article.is_downloaded = True
            article.article = body
            article.save()

            print(description.encode('utf-8'))
        except (urllib2.HTTPError, ValueError) as err:
            print err
            time.sleep(20)
            failed_attempts += 1
            if failed_attempts < 10:
                continue

Any ideas?

Rob
  • 7,028
  • 15
  • 63
  • 95
  • Why you would expect the first one to come up with a different article each iteration, is beyond me. The second suggestion should loop through the entire queryset though. Maybe you should also fix the indentation in your post! – user2390182 Sep 30 '17 at 17:35
  • Fixing indentation now. The reason why .first() should be different is because of the lines "article.is_locked = True" and "article.save()" while the filter only gets articles with "is_locked=False" – Rob Sep 30 '17 at 17:38
  • can you provide the code of `get_article()`? – dahrens Sep 30 '17 at 17:42
  • @Rob just to confirm, are you sure that `for article in articles` is not working? Can you print the `pk` of each `article`? – Jahongir Rahmonov Sep 30 '17 at 17:44
  • DO you have the same effect if you do that loop in the shell? – user2390182 Sep 30 '17 at 17:45
  • @dahrens you were right, something about get_article() is messing things up – Rob Sep 30 '17 at 17:49
  • @Rob may you add the output of your prints for one or two iterations of the outer loop, including those from `get_article`? replace real urls with dummy data. – dahrens Sep 30 '17 at 18:18
  • please check it out....https://stackoverflow.com/questions/962619/how-to-pull-a-random-record-using-djangos-orm – Basant Rules Sep 30 '17 at 18:24

1 Answers1

1

The way I see it you have an infinite loop in your get_article() function.

Consider this simplified version of your get_article() for illustration purposes:

def get_article(url, title, article):
    failed_attempts = 0
    # Note how this while loop runs endlessly.
    while True:
        try:
            # doing something here without calling `return` anywhere
            # I'll just write `pass` for the purpose of simplification
            pass
        except (urllib2.HTTPError, ValueError) as err:
            failed_attempts += 1
            if failed_attempts < 10:
                # you're calling `continue` here but you're not calling
                # `break` or `return` anywhere if failed_attemps >= 10
                # and therefore you're still stuck in the while-loop
                continue

Note that simply not calling continue will not stop a while loop:

while True:
    print('infinite loop!')
    if some_condition:
        # if some_condition is truthy, continue
        continue
    # but if it's not, we will continue anyway. the above if-condition
    # therefore doesn't make sense

A fixed version may look like this, I omitted the details:

def get_article(url, title, article):
    failed_attempts = 0
    while True:
        try:
            # it's considered good practice to only put the throwing
            # statement you want to catch in the try-block
            content = urllib2.urlopen(url).read()
        except (urllib2.HTTPError, ValueError) as err:
            failed_attempts += 1
            if failed_attempts == 10:
                # if it's the 10th attempt, break the while loop.
                # consider throwing an error here which you can handle
                # where you're calling `get_article` from. otherwise
                # the caller doesn't know something went wrong
                break
        else:
            # do your work here
            soup = BeautifulSoup(content, "html5lib")
            # ...
            article.save()
            # and call return!
            return
olieidel
  • 1,505
  • 10
  • 10
  • `failed_attempts += 1; if failed_attempts < 10:` should take care of this. – dahrens Sep 30 '17 at 18:38
  • I don't think so. How does "not calling continue" stop a while loop? I added some illustrative code in the middle (second code example) – olieidel Sep 30 '17 at 18:39
  • 2
    A `for` loop might increase readability - `break` on success, otherwise it ends after defined attempts. – dahrens Sep 30 '17 at 19:38
  • `for _ in range(num_attempts):` Yes, that's a good point and definitely prone to cause less confusion :) – olieidel Oct 01 '17 at 11:44