1

I am in the process of building a new function to import data from a file. I'm building my own so it can work with the same general function call as loadtxt() and deal with headers for data columns. The issue comes with the size of the data files, the last one I was working with is 1.3gigs. In an effort to minimize ram usage I was planning on loading the file into a variable, breaking it up into an array "f" and then dealing with 50,000 lines at a time. This way I could put those 50,000 processed lines into an array and then delete them from the variable with the original file. (Processing and then deleting one line at a time takes too long, hence the idea to do 50,000.)

For the processing I am using:

import numpy as np

def processing(arr, delimiter, dtype):
    return map(dtype, arr.split(delimiter))

df = open(file, 'r')
f = df.readlines()
df.close()

fn = vectorize(processing, otypes=[float])

fn works on the condition that I don't pass it an array. Consider:

a = ['1,2,3', '4,5,6', '7,8,9']

This:

fn(a, ',', int)

returns,

"ValueError: setting an array element with a sequence."

The rest of my function works. Variant without this works, really slow for large files though. I have a short, one time script that loaded the file in under 4 minutes so that's the goal (loadtxt() used up ~16gigs of ram and crashed my machine). I would like to try this vectorize idea but if there is a better way to break the data up while minimizing ram usage I'm open to that.

styvane
  • 59,869
  • 19
  • 150
  • 156
BobJoe1803
  • 49
  • 1
  • 8
  • 3
    Have you considered opening the file yourself, reading in only a fraction of it, putting that fraction into a StringIO object, and passing the StringIO object to `loadtxt()`. There is an example of this in the `numpy.loadtxt` documentation. – aghast Feb 01 '16 at 06:03

1 Answers1

0

vectorize is not a substitute for iteration. It's a way of giving you the full power of numpy broadcasting. The resulting function takes one or more arrays, broadcasts them together, and then feeds a simple tuple of values (i.e. scalars, one from each array) to the wrapped function.

In your code f is a list of lines - all the lines from the file.

You could do something like:

N = len(f)
for i in range(0,N,1000):
    a = np.loadtxt(f[i:i+1000], delimiter=',')
    <process array a>

In other words, feed the lines of f to loadtxt in blocks.

Actually you don't need to read all of the lines at once. You could write a generator that reads the file line by line, and returns blocks of lines.

The use of generators to feed loadtxt (or genfromtxt) has been discussed before.


Working example of vectorize

In [121]: def processing(astr):
    return list(map(int, astr.split(',')))[0] # py3
   .....: 
In [122]: processing(a[0])
Out[122]: 1
In [123]: fn=np.vectorize(processing, otypes=[int])
In [124]: fn(a)
Out[124]: array([1, 4, 7])

This function takes a string and returns one int. It's no better than

In [125]: [processing(l) for l in a]
Out[125]: [1, 4, 7]

I removed delimiter and dtype from the arguments because we don't want to iterate over those parameters. There is an exclude parameter to vectorize; but I didn't want to play with that.

vectorize also takes multiple values for the otypes, but I haven't seen an example of that use. Your function didn't work because it returned a sequence (e.g. 3 ints), but vectorize expected it to return one value (a scalar float or int).


If you specify otypes as object, your processing does work - sort of

In [126]: def processing(astr):
    return list(map(int, astr.split(','))) # py3
   .....: 
In [127]: fn=np.vectorize(processing, otypes=[object])
In [128]: fn(a)
Out[128]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=object)

But why not just iterate?

In [129]: [processing(l) for l in a]
Out[129]: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
In [130]: np.array([processing(l) for l in a])
Out[130]: 
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

I vaguely recall some SO questions about errors when using vectorize with object returns.

While I'm on a roll, I might as well illustrate broadcasting:

Here's your function that takes 2 values - a string and a function, and applies the function to each element of the split:

In [131]: def processing(astr, conv):
    return list(map(conv, astr.split(','))) # py3
   .....: 
In [132]: fn=np.vectorize(processing, otypes=[object])

Now the vectorized function takes 2 inputs, e.g. a list and function:

In [133]: fn(a,int)
Out[133]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=object)

or a different function for each string in a (2 lists)

In [134]: fn(a,[int,float,str])
Out[134]: array([[1, 2, 3], [4.0, 5.0, 6.0], ['7', '8', '9']], dtype=object)

Or make the 2nd list a 'column' list - and get back a (2,3) array of lists. One row is ints, the other floats. Obviously I could replace the lists with arrays (0, 1d, 2d etc).

In [136]: fn(a,[[int],[float]])
Out[136]: 
array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]],
       [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]], dtype=object)

If you need this kind of flexibility in inputs then use vectorize. But if you are just iterating over one array or list - do it directly.


I found an example of multiple otypes: https://stackoverflow.com/a/30255971/901925

Applied to this case:

In [140]: def processing(astr):
    return tuple(map(int, astr.split(','))) # py3
   .....:

It's important that it returns a tuple, not a list or array.

In [141]: processing(a[0])
Out[141]: (1, 2, 3)
In [142]: fn=np.vectorize(processing, otypes=[int,int,int])

Note that there has to be an otype for each item of the returned tuple.

In [144]: fn(a)
Out[144]: (array([1, 4, 7]), array([2, 5, 8]), array([3, 6, 9]))

But [1, 4, 7] is the first value of each of the 3 inputs. It's returning a tuple of arrays, not one array.

In [146]: x,y,z=fn(a)
In [147]: x
Out[147]: array([1, 4, 7])

This behavior bothered the other questioner, and I doubt if it's what you want either. :)

https://stackoverflow.com/a/30088791/901925 - a vectorizing example with time tests.

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I'm still working my way through this but you asked why not just iterate. I was hoping that vectorize would allow me to call the same function once for the many instances reducing over head, and that it might allow the processor to deal with several processes at the same time reducing time. I'm getting the impression that it just runs them one at a time. – BobJoe1803 Feb 01 '16 at 23:59
  • 2
    The `np.vectorize` is not a multiprocessing wrapper. It's an iteration wrapper. In other contexts it saved about 20% time over user implemented iteration. But if task is to process 50,000 lines at a time, the iteration overhead will be small compared to the processing time. – hpaulj Feb 02 '16 at 00:57