Read CSV files and append to new column of Numpy array

Question

I am reading a list of CSV files and always appending the data to a new column in my array. My current solution is analogous to the following:

import numpy as np

# Random generator and paths for the sake of reproducibility 
fake_read_csv = lambda path: np.random.random(5) 
paths = ['a','b','c','d']

first_iteration=True
for path in paths:
    print(f'Reading path {path}')
    sub = fake_read_csv(path)
    if first_iteration:
        first_iteration=False
        pred = sub
    else:
        pred = np.c_[pred, sub] # append to a new column
print(pred)

I was wondering if it is possible to simplify the loop. For example, something like this:

import numpy as np
fake_read_csv = lambda path: np.random.random(5)
paths = ['a','b','c','d']

pred = np.array([])
for path in paths:
    print(f'Reading path {path}')
    sub = fake_read_csv(path)
    pred = np.c_[pred, sub] # append to a new column

Which raises the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Why do you want to do that in `numpy`? Maybe use something like this: https://stackoverflow.com/a/21232849/10197418 ? — FObersteiner, Dec 06 '19 at 14:10
@MrFuppes memory constraint, but thanks for the hint anyway! — Fernando Wittmann, Dec 06 '19 at 14:14
@FernandoWittmann. The suggested method should use a lot less memory than what you're doing — Mad Physicist, Dec 06 '19 at 14:20
Consider `pandas`, it's convenient for handling csv and tabular data. — Quang Hoang, Dec 06 '19 at 14:28
@QuangHoang I am currently using pandas, however, in the end, I will have to convert to `np.array` in order to be used as input of a Keras model. As I might have memory constraints (each CSV has 1Gb), I am considering reading each file directly into a numpy array instead of reading everything as a pandas dataframe and then converting to numpy array later. — Fernando Wittmann, Dec 06 '19 at 14:45
@QuangHoang aren't numpy arrays faster and more memory efficient than pandas dataframes? — Fernando Wittmann, Dec 06 '19 at 14:47
if you pass a dataframe to any function accepts numpy array, python will grab the underlying numpy array. The overhead is minimal, if any. — Quang Hoang, Dec 06 '19 at 14:48
@FernandoWittmann. It's traditional to accept an answer that works for you. I notice that you have a tendency not go to do that — Mad Physicist, Dec 06 '19 at 16:55

Mad Physicist · Answer 1 · 2019-12-06T15:56:27.557

For starters, every time you append, an entirely new array is allocated, which is quite wasteful. Instead, you can just combine all your columns once they're loaded:

pred = np.array([fake_read_csv(path) for path in paths], order='F').T

The transpose makes the rows you read in into columns. order='F' will ensure that the memory layout of the transposed result is the same as the array in your question.

If you want you can preallocate the buffer, either by knowing the number of rows up front, or by loading the first array. Here's an example of the latter:

first = fake_read_csv(paths[0])
buffer = np.zeros((first.size, len(paths)))
buffer[:, 0] = first
for col, path in enumerate(paths[1:], start=1):
    buffer[:, col] = fake_read_csv(path)

If your concern is calling the reader function multiple times, you can allocate the array in the loop, like this:

buffer = None
for col, path in enumerate(paths):
    data = fake_read_csv(path)
    if buffer is None:
        buffer = np.zeros((data.size, len(paths)))
    buffer[:, col] = data

This option has the additional advantage that it does not reuire any extra checking to see if you get data.

Thanks for the one-liner and the buffer! However, if I can't use list comprehension and I don't know the number of rows up front, then `fake_read_csv` will have to appear twice in the code, right? — Fernando Wittmann, Dec 06 '19 at 14:37
@FernandoWittmann. You could convert the list comprehension into a for-loop, but the idea is that the first one loads all the columns separately at the same time, then concatenates them (using 2N memory), while the second one preallocates the buffer and only holds one additional column in memory at a time — Mad Physicist, Dec 06 '19 at 15:54
@FernandoWittmann. I've added third option that does what the second one does, but only calls the reader inside the loop.. — Mad Physicist, Dec 06 '19 at 15:57

Read CSV files and append to new column of Numpy array

1 Answers1