13

Consider a large list of named items (first line) returned from a large csv file (80 MB) with possible interrupted spacing

name_line =  ['a',,'b',,'c' .... ,,'cb','cc']

I am reading the remainder of the data in line by line and I only need to process data with a corresponding name. Data might look like

data_line =  ['10',,'.5',,'10289' .... ,,'16.7','0']

I tried it two ways. One is popping the empty columns from each line of the read

blnk_cols = [1,3, ... ,97]
while data:
    ...
    for index in blnk_cols: data_line.pop(index)

the other is compiling the items associated with a name from L1

good_cols = [0,2,4, ... ,98,99]   
while data:
    ...
    data_line = [data_line[index] for index in good_cols]

in the data I am using there will definitely be more good lines then bad lines although it might be as high as half and half.

I used the cProfile and pstats package to determine my weakest links in speed which suggested the pop was the current slowest item. I switched to the list comp and the time almost doubled.

I imagine one fast way would be to slice the array retrieving only good data, but this would be complicated for files with alternating blank and good data.

what I really need is to be able to do

data_line = data_line[good_cols]

effectively passing a list of indices into a list to get back those items. Right now my program is running in about 2.3 seconds for a 10 MB file and the pop accounts for about .3 seconds.

Is there a faster way to access certain locations in a list. In C it would just be de-referencing an array of pointers to the correct indices in the array.

Additions: name_line in file before read

a,b,c,d,e,f,g,,,,,h,i,j,k,,,,l,m,n,

name_line after read and split(",")

['a','b','c','d','e','f','g','','','','','h','i','j','k','','','','l','m','n','\n']
Paul Seeb
  • 6,006
  • 3
  • 26
  • 38
  • What are you doing with data_line? Are you merely iterating it? Are you putting it into another datastructure? – Marcin Jan 25 '12 at 19:13
  • Also, have you tried a generator? – Marcin Jan 25 '12 at 19:13
  • "Consider a large list returned from a large csv file "? Are you reading the **entire** file into one list? Why? Why not process each line individually? – S.Lott Jan 25 '12 at 19:16
  • the file I am reading is a higher frequency file (ie 10 hz). I am reading in the lines and accumulating and averaging all the values in the x second interval and writing this back into a file. ie go from 10 hz to 1 hz would accumulate 10 data values (from 0 to 1 seconds) average them and output the single data line into a file for the floor(time) of the averaged data range – Paul Seeb Jan 25 '12 at 19:18
  • I am processing each line individually. Editted that for clarity – Paul Seeb Jan 25 '12 at 19:19
  • Can you provide a more accurate example of the `name_line` list and a few examples of `data_line`? I'm wondering if the `name_line` list really looks like `['a','','b','','c' .... ,'','cb','cc']` i.e. with empty strings where you have double commas. – sgallen Jan 25 '12 at 19:56
  • Does "I tried it two ways. One is popping the empty lines from each line of the read". Mean "popping the empty **columns** from each line"? If so, you might want to edit your question to use **column** when you mean column. Why are you removing columns in the first place? If it's slow, why do it? – S.Lott Jan 25 '12 at 21:15
  • Are you using `split(",")` to parse a CSV file? What's wrong with the `csv` module which handles much of this for you? – S.Lott Jan 25 '12 at 21:16
  • I was removing columns as it improved readability and simplified the code. Additionally I needed to write without the extra columns. If I use the indexing and load only the values I need into an array I imagine that might result in the speed bump I am looking for. I used split because each line is comma separated and the function looked like the perfect one for the job. Does using the csv module produce any speed benefit or return a better list than the split function? – Paul Seeb Jan 25 '12 at 22:57
  • Popping the columns appears to be faster than the indexing solution (at least using a generator) as creating the generator seems to be rather time intensive. – Paul Seeb Jan 26 '12 at 20:53
  • Replaced the generator with the corresponding for loop and got the benefit I expected (removing the pop time and giving a slight boost) for a total reduction of about .4 seconds – Paul Seeb Jan 26 '12 at 21:07

1 Answers1

12

Try a generator expression,

data_line = (data_line[i] for i in good_cols)

Also read here about Generator Expressions vs. List Comprehension

as the top answer tells you: 'Basically, use a generator expression if all you're doing is iterating once'.

So you should benefit from this.

Community
  • 1
  • 1
Johan Lundberg
  • 26,184
  • 12
  • 71
  • 97
  • Which is faster rather depends on what you're doing with it. The advantage of a generator is that it's lazy, so you don't allocate a lot memory for items which you access just once. – Marcin Jan 25 '12 at 19:25
  • @Marcin. Yes, clarified my answer. – Johan Lundberg Jan 25 '12 at 19:32
  • Refactored all of my code to fit generator expressions. I go through each data line once to process (using a generator with appropriate indexing instead of popping the blank values initially). The code runs about .3 seconds slower because I need to recreate the generator expression for each data line. – Paul Seeb Jan 26 '12 at 20:52
  • @PaulSeeb I'm confused. *creating* the generator expression should not take any time. – Johan Lundberg Jan 26 '12 at 20:56
  • there are 25000 lines in this file. I need to make a new generator for each line to process all the data in the line unless I can "reset" the generator for each line. I did some research on that and found that was impossible. – Paul Seeb Jan 26 '12 at 21:06