3

I have a list of approximately 200 000+ objects, each one representing a file (but not actually holding the file's contents, just the full path name and date).

The program I am writing copies any subset of these files, depending on the user-provided date range. I first create a list of all of the files in the source directory (with the glob module), create an instance of my file-representation class and add that instance to a list, like so:

for f in glob.glob(srcdir + "/*.txt"):
    LOG_FILES.append(LogFile(f))

Now, to keep the copying of files quick and the block of code clean, I remove the LogFile objects that do not fit inside of the date range.

for i in xrange(0, len(LOG_FILES)):
    if LOG_FILES[i].DATE < from_date or LOG_FILES[i].DATE > to_date:
        del(LOG_FILES[i])

Afterwards, I can just copy the files that are left in the list:

for logfile in LOG_FILES:
    os.copy(logfile.PATH, destdir)

The issue occurs with the for i in xrange... example: I get thrown an IndexError when the value of i gets to 63792.

IndexError: list index out of range.

Any ideas?

EDIT Thank you very much for the quick responses! Now that I think about it, it was a silly oversight on my part. Again, thank you, everyone. :)

nesv
  • 786
  • 6
  • 10
  • 2
    you are modifying the sequence you are iterating over. – Corey Goldberg Jan 20 '11 at 21:42
  • Date ranges are easily queryable in SQL. If the application is meant for distribution to users perhaps a database would be best. Just an idea. – krs1 Jan 20 '11 at 21:44
  • Thanks krs, but it's not meant for distribution - it just grabs sets of log files from an ancient, archaic system we are running (which I will eventually be re-writing) that creates logs on particular events. – nesv Jan 20 '11 at 22:28
  • I posted two rectifications concerning your problem – eyquem Jan 20 '11 at 23:52
  • @eyquem: Thanks for that - I modified Cpfohl's example when I implemented it; I didn't use it directly. I would post what I did, but I do not have the code handy to post what worked. – nesv Jan 24 '11 at 03:22

7 Answers7

7

From the docs:

It is not safe to modify the sequence being iterated over in the loop (this can only happen for mutable sequence types, such as lists). If you need to modify the list you are iterating over (for example, to duplicate selected items) you must iterate over a copy.

For your case, I'd actually suggest looking into using generator expressions and itertools.ifilter, to avoid making unnecessary copies of your big list of files.

Will McCutchen
  • 13,047
  • 3
  • 44
  • 43
3

The problem with your method is that del() is removing the entry in the list at that index and re-ordering the list.

For example, if you have five items in a list and call del() on the third index, the contents of the list are shifted down so that a different element takes the third index.

list = [1,2,3,4,5]
del(list[2])
print list     # outputs [1, 2, 4, 5]
print list[2]  # outputs 4

Since you are looping from 0 to the original size of the list, even if you have removed just one item from the list, you will eventually arrive at indices which are no longer contained in the list.

A much simpler approach would be to filter the list as you add items to it.

for f in glob.glob(srcdir + "/*.txt"):
    lf = LogFile(f)
    if lf.DATE < from_date and lf.DATE > to_date:
        LOG_FILES.append(lf)

This could probably be made more pythonic but should be readable enough to get the point across.

matt b
  • 138,234
  • 66
  • 282
  • 345
2

[EDIT] Oops, I forgot to invert the "<" and ">" and add an 'equals' sign.

LOG_FILES = [LogFile(f) for f in glob.glob(srcdir + "/*.txt")
                        if from_date <= f.DATE <= to_date]

This can replace the whole initalization of LOG_FILES. It's a list comprehension (if you wish you can make it a generator (which doesn't get evaluated until it's enumerated) by replacing the [ ] with ( ). That might be more efficient depending on what you do with it.

You need to do this because editing a collection while enumerating it isn't allowed. (see above, far more eloquent answers).

You can read the expression above like this:

"create a list (or enumerable) of the result of LogFile, when it's handed 'f' for each f in 'glob.glob(...)' but only if the 'if' statement is true."

See: The List Comprehension section of that link.

Community
  • 1
  • 1
Chris Pfohl
  • 18,220
  • 9
  • 68
  • 111
1

If you're looping over an array with a fixed upper limit and deleting elements at the same time you will generate index errors. Either you must loop over a copy or use a dynamic index. Since you stated the array is big we use the latter:

limit, i = len(LOG_FILES), 0
while i < limit:
    if LOG_FILES[i].DATE < from_date and LOG_FILES[i].DATE > to_date:
        del(LOG_FILES[i])
        limit -= 1
    else:
        i += 1
orlp
  • 112,504
  • 36
  • 218
  • 315
1

You could also use filter:

LOG_FILES = filter(lambda log_file: log_file.DATE < from_date and \
                                    log_file.DATE > to_date, LOG_FILES)
Dan Breen
  • 12,626
  • 4
  • 38
  • 49
1

There's a problem in the Cpfohl's answer:

LOG_FILES = [LogFile(f) for f in glob.glob(srcdir + "/*.txt")
             if f.DATE >= from_date and f.DATE <= to_date]

Since

for f in glob.glob(srcdir + "/*.txt"):
    LOG_FILES.append(LogFile(f))

therefore a LOG_FILES[i] is a LogFile(f) and then a LOG_FILES[i].DATE is a LogFile(f).DATE, not a f.DATE

eyquem
  • 26,771
  • 7
  • 38
  • 46
0

1) deleting elements during the iteration in a list from end to beginning of the list dissolve problems

LOG_FILES = [ 1,2,30,2,5,8,30,3,2,37,22,30,27,30,4 ]

print LOG_FILES

L = len(LOG_FILES)-1
for i,x in enumerate(LOG_FILES[::-1]):
    print i,L-i,' ',LOG_FILES[L-i],x
    if x>15:
        del LOG_FILES[L-i]

print LOG_FILES

result

[1, 2, 30, 2, 5, 8, 30, 3, 2, 37, 22, 30, 27, 30, 4]
0 14   4 4
1 13   30 30
2 12   27 27
3 11   30 30
4 10   22 22
5 9   37 37
6 8   2 2
7 7   3 3
8 6   30 30
9 5   8 8
10 4   5 5
11 3   2 2
12 2   30 30
13 1   2 2
14 0   1 1
[1, 2, 2, 5, 8, 3, 2, 4]

2) By the way

if LOG_FILES[i].DATE < to_date and LOG_FILES[i].DATE > from_date :

can be written

if from_date  < LOG_FILES[i].DATE < to_date:
eyquem
  • 26,771
  • 7
  • 38
  • 46