0

I'm trying to write items from a list to several files. I would like to name each file according to its date. Please bear in mind I know I shouldn't use regular expressions to scrape HTML but for the time being it serves me well. Excuse the ignorance but I'm a beginner. This scraping is only for academic purposes. Thank you in advance.

    from urllib import urlopen
    import re

    webpage = urlopen('x').read()
    date = re.compile('[0-9]{2}-[a-zA-Z]{3}-[0-9]{4}')
    article =  re.compile('<span>.*<div>', re.DOTALL)
    findDate = re.findall(patFinderDate,webpage)
    findArticle = re.findall(patFinderArticle,webpage)

    listIterator = []
    listIterator[:] = range(0,1000)

    for i in listIterator:
        filename = findDate[i]
        with open(filename,"w") as f:
            f.write(i)
            f.close()
  • You can just do `for i in range(0,1000)` (or even `for i in range(1000)`) ... No need for `listIterator` here. – mgilson Sep 19 '12 at 17:55
  • what's wrong with using regular expressions to scrape html? – Hans Then Sep 19 '12 at 17:55
  • 7
    Also, you should be more explicit about what your problem actually is ... What is this doing? What should it be doing? – mgilson Sep 19 '12 at 17:55
  • @HansThen: html is more powerful than a regular language (the ones that regular expressions match), thus no matter how clever your regexes are, some valid HTML will break them – Claudiu Sep 19 '12 at 17:56
  • 1
    @HansThen -- I hope you're joking. If not, read [this](http://stackoverflow.com/a/1732454/748858) – mgilson Sep 19 '12 at 17:56
  • I was speaking tongue in cheek. However, while it is true that regexen cannot _parse_ html, for most practical purposes a simple regex will extract your data just fine. – Hans Then Sep 19 '12 at 17:59
  • Sorry, I should of specified. The error that comes up is: filename = findPatDate[i] TypeError: list indices must be integers, not str – R. Kualki Sep 19 '12 at 18:02
  • I would like files to be saved with each item in the list (date and article), the title being the date. – R. Kualki Sep 19 '12 at 18:04
  • Your code doesn't seem complete. Is `date` and `patFinderDate` the same thing? Ditto for `article` vs `patFinderArticle`. – tripleee Sep 19 '12 at 18:17
  • Also consider using http://pypi.python.org/pypi/mechanize/ instead of regex. – Drahkar Sep 19 '12 at 18:31

1 Answers1

1

If you are sure you have as many dates as articles, you can rewrite your code roughly as follows:

from urllib import urlopen
import re

webpage = urlopen('x').read()
date_p = re.compile('[0-9]{2}-[a-zA-Z]{3}-[0-9]{4}')
article_p =  re.compile('<span>.*<div>', re.DOTALL)
allDates = re.findall(date_p,webpage)
allArticles = re.findall(article_p,webpage)

for date, article in zip(allDates, allArticles):
    with open(date,"w") as f:
        f.write(article)

The zip() function "zips" the two iterables together into one and returns a 2-tuple at every iteration - that's the reason you need to check if there's as many dates as articles

Hans Then
  • 10,935
  • 3
  • 32
  • 51
  • Thank you very much for the response, it was very helpful. However when I execute it only one file is created with one specific date. If I leave the console running and I delete that file another one appears with the same date but a different article. Any ideas on what's going wrong. Every article has only one date to it, no duplicates. Thank you in advance – R. Kualki Sep 20 '12 at 14:28
  • You might try looking at the dates in `allDates`. E.g. `for date in allDates: print date`. If all the dates are the same, maybe the dates in your html are also all the same. – Hans Then Sep 20 '12 at 20:19