Consider a large list of named items (first line) returned from a large csv file (80 MB) with possible interrupted spacing
name_line = ['a',,'b',,'c' .... ,,'cb','cc']
I am reading the remainder of the data in line by line and I only need to process data with a corresponding name. Data might look like
data_line = ['10',,'.5',,'10289' .... ,,'16.7','0']
I tried it two ways. One is popping the empty columns from each line of the read
blnk_cols = [1,3, ... ,97]
while data:
...
for index in blnk_cols: data_line.pop(index)
the other is compiling the items associated with a name from L1
good_cols = [0,2,4, ... ,98,99]
while data:
...
data_line = [data_line[index] for index in good_cols]
in the data I am using there will definitely be more good lines then bad lines although it might be as high as half and half.
I used the cProfile and pstats package to determine my weakest links in speed which suggested the pop was the current slowest item. I switched to the list comp and the time almost doubled.
I imagine one fast way would be to slice the array retrieving only good data, but this would be complicated for files with alternating blank and good data.
what I really need is to be able to do
data_line = data_line[good_cols]
effectively passing a list of indices into a list to get back those items. Right now my program is running in about 2.3 seconds for a 10 MB file and the pop accounts for about .3 seconds.
Is there a faster way to access certain locations in a list. In C it would just be de-referencing an array of pointers to the correct indices in the array.
Additions: name_line in file before read
a,b,c,d,e,f,g,,,,,h,i,j,k,,,,l,m,n,
name_line after read and split(",")
['a','b','c','d','e','f','g','','','','','h','i','j','k','','','','l','m','n','\n']