1

I am reading a list of CSV files and always appending the data to a new column in my array. My current solution is analogous to the following:

import numpy as np

# Random generator and paths for the sake of reproducibility 
fake_read_csv = lambda path: np.random.random(5) 
paths = ['a','b','c','d']

first_iteration=True
for path in paths:
    print(f'Reading path {path}')
    sub = fake_read_csv(path)
    if first_iteration:
        first_iteration=False
        pred = sub
    else:
        pred = np.c_[pred, sub] # append to a new column
print(pred)

I was wondering if it is possible to simplify the loop. For example, something like this:

import numpy as np
fake_read_csv = lambda path: np.random.random(5)
paths = ['a','b','c','d']

pred = np.array([])
for path in paths:
    print(f'Reading path {path}')
    sub = fake_read_csv(path)
    pred = np.c_[pred, sub] # append to a new column

Which raises the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Fernando Wittmann
  • 1,991
  • 20
  • 16

1 Answers1

1

For starters, every time you append, an entirely new array is allocated, which is quite wasteful. Instead, you can just combine all your columns once they're loaded:

pred = np.array([fake_read_csv(path) for path in paths], order='F').T

The transpose makes the rows you read in into columns. order='F' will ensure that the memory layout of the transposed result is the same as the array in your question.

If you want you can preallocate the buffer, either by knowing the number of rows up front, or by loading the first array. Here's an example of the latter:

first = fake_read_csv(paths[0])
buffer = np.zeros((first.size, len(paths)))
buffer[:, 0] = first
for col, path in enumerate(paths[1:], start=1):
    buffer[:, col] = fake_read_csv(path)

If your concern is calling the reader function multiple times, you can allocate the array in the loop, like this:

buffer = None
for col, path in enumerate(paths):
    data = fake_read_csv(path)
    if buffer is None:
        buffer = np.zeros((data.size, len(paths)))
    buffer[:, col] = data

This option has the additional advantage that it does not reuire any extra checking to see if you get data.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Thanks for the one-liner and the buffer! However, if I can't use list comprehension and I don't know the number of rows up front, then `fake_read_csv` will have to appear twice in the code, right? – Fernando Wittmann Dec 06 '19 at 14:37
  • @FernandoWittmann. You could convert the list comprehension into a for-loop, but the idea is that the first one loads all the columns separately at the same time, then concatenates them (using 2N memory), while the second one preallocates the buffer and only holds one additional column in memory at a time – Mad Physicist Dec 06 '19 at 15:54
  • @FernandoWittmann. I've added third option that does what the second one does, but only calls the reader inside the loop.. – Mad Physicist Dec 06 '19 at 15:57