Writing items from list to several files - Python

Question

I'm trying to write items from a list to several files. I would like to name each file according to its date. Please bear in mind I know I shouldn't use regular expressions to scrape HTML but for the time being it serves me well. Excuse the ignorance but I'm a beginner. This scraping is only for academic purposes. Thank you in advance.

    from urllib import urlopen
    import re

    webpage = urlopen('x').read()
    date = re.compile('[0-9]{2}-[a-zA-Z]{3}-[0-9]{4}')
    article =  re.compile('<span>.*<div>', re.DOTALL)
    findDate = re.findall(patFinderDate,webpage)
    findArticle = re.findall(patFinderArticle,webpage)

    listIterator = []
    listIterator[:] = range(0,1000)

    for i in listIterator:
        filename = findDate[i]
        with open(filename,"w") as f:
            f.write(i)
            f.close()

You can just do `for i in range(0,1000)` (or even `for i in range(1000)`) ... No need for `listIterator` here. — mgilson, Sep 19 '12 at 17:55
Also, you should be more explicit about what your problem actually is ... What is this doing? What should it be doing? — mgilson, Sep 19 '12 at 17:55
@HansThen: html is more powerful than a regular language (the ones that regular expressions match), thus no matter how clever your regexes are, some valid HTML will break them — Claudiu, Sep 19 '12 at 17:56
@HansThen -- I hope you're joking. If not, read [this](http://stackoverflow.com/a/1732454/748858) — mgilson, Sep 19 '12 at 17:56
I was speaking tongue in cheek. However, while it is true that regexen cannot _parse_ html, for most practical purposes a simple regex will extract your data just fine. — Hans Then, Sep 19 '12 at 17:59
Sorry, I should of specified. The error that comes up is: filename = findPatDate[i] TypeError: list indices must be integers, not str — R. Kualki, Sep 19 '12 at 18:02
I would like files to be saved with each item in the list (date and article), the title being the date. — R. Kualki, Sep 19 '12 at 18:04
Your code doesn't seem complete. Is `date` and `patFinderDate` the same thing? Ditto for `article` vs `patFinderArticle`. — tripleee, Sep 19 '12 at 18:17
Also consider using http://pypi.python.org/pypi/mechanize/ instead of regex. — Drahkar, Sep 19 '12 at 18:31

Hans Then · Answer 1 · 2012-09-19T22:30:45.297

1

If you are sure you have as many dates as articles, you can rewrite your code roughly as follows:

from urllib import urlopen
import re

webpage = urlopen('x').read()
date_p = re.compile('[0-9]{2}-[a-zA-Z]{3}-[0-9]{4}')
article_p =  re.compile('<span>.*<div>', re.DOTALL)
allDates = re.findall(date_p,webpage)
allArticles = re.findall(article_p,webpage)

for date, article in zip(allDates, allArticles):
    with open(date,"w") as f:
        f.write(article)

The zip() function "zips" the two iterables together into one and returns a 2-tuple at every iteration - that's the reason you need to check if there's as many dates as articles

edited Sep 19 '12 at 22:30

answered Sep 19 '12 at 18:31

Hans Then

10,935
3
32
51

Thank you very much for the response, it was very helpful. However when I execute it only one file is created with one specific date. If I leave the console running and I delete that file another one appears with the same date but a different article. Any ideas on what's going wrong. Every article has only one date to it, no duplicates. Thank you in advance – R. Kualki Sep 20 '12 at 14:28
You might try looking at the dates in `allDates`. E.g. `for date in allDates: print date`. If all the dates are the same, maybe the dates in your html are also all the same. – Hans Then Sep 20 '12 at 20:19

Writing items from list to several files - Python

1 Answers1