2

I am reading in numbers from a file and casting them to floats. The numbers look like this.

1326.617827, 1322.954823, 1320.512821, 1319.291819...

I split each line at commas and then create the list of floats through a list comprehension.

def listFromLine(line):
    t = time.clock()
    temp_line = line.split(',')
    print "line operations: " + str(time.clock() - t)
    t = time.clock()
    ret = [float(i) for i in temp_line]
    print "float comprehension: " + str(time.clock() - t)
    return ret

The output is looking something like this

line operations: 5.52103727549e-05
float comprehension: 0.00121321255003
line operations: 9.52025017378e-05
float comprehension: 0.000943885026522
line operations: 7.0782529173e-05
float comprehension: 0.000946716327689

Casting to an int and then dividing by 1.0 is a lot faster, but is useless in my case as I need to keep the numbers after the decimal point.

I saw this question and had a go at using pandas.Series but that went slower than what I was doing previously.

In[38]: timeit("[float(i) for i in line[1:-2].split(',')]", "f=open('pathtofile');line=f.readline()", number=100)
Out[37]: 0.10676022701363763
In[39]: timeit("pandas.Series(line[1:-2].split(',')).apply(lambda x: float(x))", "import pandas;f=open('pathtofile');line=f.readline()", number=100)
Out[38]: 0.14640622942852133

Changing the format of the file may be an option if that could speed it up, but an speeding it up at the reading end would be preferable.

Community
  • 1
  • 1
cjm
  • 814
  • 1
  • 11
  • 26

2 Answers2

4

You're going to want to use numpy to create an array of floats using loadtxt. http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

Something like:

import numpy
array = numpy.loadtxt('/path/to/data.file', dtype=<type 'float'>, delimiter=',')

If that doesn't work because of the spaces, you might want to try genfromtxt with the 'autostrip' option: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

This is infinitesimally faster than splitting/converting manually or with a csvreader.

J.J
  • 3,459
  • 1
  • 29
  • 35
0

First of all for get ride of splitting your lines you can use csv module for reading the file,which read the file by specifying a delimiter and returns an iterator reader object contains all the lines splitted by comma :

>>> import csv
>>> with open('filename', newline='') as csvfile:
...     spamreader = csv.reader(csvfile, delimiter=',')
...     for row in spamreader:
             #do stuff 

Then for converting your number to float since you want to apply the built in function float on your numbers you better to use map function which performs better than list comprehension in this case.

So for each line (row when you read with csv ) you can do :

...     for row in spamreader:
             numbers=map(float,row)

Also about using pandas and its performance you may know that tools like it or maybe Numpy performs better when you are dealing with huge set of data not for small sets, because for small sets the cost of converting python types to C type is more than the advantage of computing the result. For more info read this question and the complete answer Why list comprehension is much faster than numpy for multiplying arrays?

Community
  • 1
  • 1
Mazdak
  • 105,000
  • 18
  • 159
  • 188