1

I have a list of lists just shy of two million elements, each with 7 entries.

I run a machine learning algorithm on the data and would like to append the result of the classification to the end of each element.

I use the .append() feature, something like

for j in range(len(data)):
    data[j].append(results[j])

However, this takes a lot of time (8+ hours and it had still not terminated).

I'm wondering if there is a more efficient way to do this. The data is read in from a CSV file, so maybe I could write the results directly into the CSV?

I was thinking about using numpy arrays, but I recall someone saying that lists are faster.

Anyone have any ideas?

EDIT: here is my code

import csv

    with open("measles_data_b", 'r') as f:

        reader = csv.reader(f)

        t = list(reader)

 ### Perform the machine Learning.  That bit works fine.
 #At this point, t is a list with size=1971203, and each element in t has 7 elements of its own
 # results is a list with the same number of elements.  Its entries are
 # one of three things: '1','2','0'.

 for j in range(len(t)):
     t[j].append(results[j])
user3600497
  • 1,621
  • 1
  • 18
  • 22
  • Have you considered creating each element with 8 entries to start with, so you can just do `data[j][7] = results[j]`, and thus avoiding the resizing of each list? – Greg Hewgill Jul 12 '15 at 22:21
  • I had not. I'm not sure how I would do that, since the data is read in from a csv. Any tips? – user3600497 Jul 12 '15 at 22:24
  • 1
    You could profile your code, and figure out how much time is spent reading from CSV, how much time is spent appending, and how much time is spent writing back to CSV. – Dean J Jul 12 '15 at 22:25
  • Can you share the code for the current data read, so we can just help you modify that? – Dean J Jul 12 '15 at 22:26
  • Sure. Let me just edit my post. – user3600497 Jul 12 '15 at 22:28
  • 2
    Are you absolutely certain that it is this mere `append` loop that takes 8+ hours? I just ran that loop you posted with two million element lists and it took two seconds or something like that. Which is exactly what I'd expect, appending is cheap, very cheap. I find it hard to believe that this time isn't being spend in the rest of the code. –  Jul 12 '15 at 22:29
  • I ran it from 9am to 5pm and it had not terminated. I used some `print` statements to show when the learning finishes and when the csv is read into a list etc. When I terminate the code from the command line (I've tried to run this code several times), it always says that it terminates in the loop where I append the results. – user3600497 Jul 12 '15 at 22:33
  • Which version of python are you using? But overall, I agree with @delnan; `timeit for i in lists: i.append(7)` with `lists` being 2 million lists got me 157ms. – NightShadeQueen Jul 12 '15 at 22:43
  • 2
    Please double and triple check. If that's your whole code, I bet three to one that it's not hanging because the append loop takes so long. (There was a [garbage collection bug in old Python versions that would cause list.append to be extremely slow](http://stackoverflow.com/q/2473783/395760), but far from taking *hours* for two million elements. I'm not even sure it applies here, since the actual appending happens on 7-element lists.) –  Jul 12 '15 at 22:43
  • 1
    I used a list comprehension to join by appending onto the ends of 2 million randomly generated lists a list of 2 millions randomly generated elements. Generating the lists took about 15 seconds with all the random number generation. Evaluating the list comprehension took less than a second. I agree with others that something else is going on here and append isn't the culprit. – John Coleman Jul 12 '15 at 22:59
  • A good way to check that its definitely the append portion of your code is to save the analysis output of the machine learning algorithm to disk and exit. Then have a nother program load both the original data, and the analysis output and combine them. That way you know for sure what is slow – dermen Jul 12 '15 at 23:58

1 Answers1

0

As an experiment, run the following code:

import random

def append_items(lists, items):
    for i in range(len(lists)):
        lists[i].append(items[i])

rand_lists = [[random.randint(0,9) for i in range(7)] for j in range(2000000)]
rand_list = [random.randint(0,9) for i in range(2000000)]

print("Lists generated")
append_items(rand_lists,rand_list)
print("Lists appended")

When I run it I need to wait for 20-30 seconds to see "Lists generated" printed, but the next print is almost instantaneous. If you don't get this sort of behavior then you have a buggy Python installation. If not -- hard to say what is happening. It might be interesting to look at type(t[0]) perhaps you have a list of list-like objects rather than a list of lists and your list-like objects implement an inefficient append method (I haven't used it yet but it seems at least possible that csv.reader returns some sort of custom object).

John Coleman
  • 51,337
  • 7
  • 54
  • 119