0

How to convert a .csv file to .npy efficently?

I've tried:

import numpy as np

filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)

While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.

Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?

The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.

Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?

Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.

Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?

There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.

Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:

Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)

alvas
  • 115,346
  • 109
  • 446
  • 738
  • Does this answer your question? [Efficient way to process CSV file into a numpy array](https://stackoverflow.com/questions/34909077/efficient-way-to-process-csv-file-into-a-numpy-array) – Josh Friedlander Oct 13 '22 at 12:11
  • 1
    Not very helpful but you can write the code to save to the NumPy format yourself, and just skip any interaction with the numpy code at all. The hardest part would be creating the header bytes https://numpy.org/devdocs/reference/generated/numpy.lib.format.html – Tom McLean Oct 13 '22 at 13:04
  • Which is the big time user, the loadtxt or the save? `np.save` is a straight forward write of the array data, so should be relatively fast. `loadtxt` is, or was, python text handling, though recent version is supposed to be faster – hpaulj Oct 13 '22 at 13:21
  • depending on the dtype of the tensor, you are maybe dealing with 90GB of data. you can use many tools (including panda or a simple read convert by generators to read the csv in chunk and store. why you want to save all in one file? you will have similar problems (like memory) while reading to memory as well. It is however possible to append to the npy files format (on 0-dim) but seems to me if these are embedding, should be treated as data and better be in chunk and index for easy access. – amirhm Oct 26 '22 at 11:06
  • by the way in any case even in you save in very naive binary format converting to numpy is not difficult, you could use the ndarray and giving dimension and dtype you could point the buffer which holds the data, and that is your conversion. – amirhm Oct 26 '22 at 11:10

5 Answers5

4

Nice question; Informative in itself.

I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.

I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).

Reasoning about memory...

Notice that a simple boolean array will cost ~12 GBytes of memory:

>>> print("{:.1E} bytes".format(
        np.array([True]).itemsize * 12E6 * 1024
    ))
1.2E+10 bytes

And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:

>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8

It's a lot of memory (which you know, just want to emphasize).

At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.

My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.

I/O text

Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.

More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.

The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.

So, the steps I would go through are:

0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array

Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.

Hope that helps.

Brandt
  • 5,058
  • 3
  • 28
  • 46
1

TL;DR

Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in @Brandt answer.


Reading the data, then processing it (Kinda answering Q part 2)

To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.

from turicreate import SFrame

filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data

or

import vaex

filename = "myfile.csv"
df = vaex.from_csv(filename, 
    convert=True, 
    chunk_size=50_000_000)

df.apply(...)

Converting the read data into numpy array (kinda answering Q part 1)

While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.

The turicreate.SFrame.to_numpy documentation writes:

Converts this SFrame to a numpy array

This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.

And the vaex documentation writes:

In-memory data representations

One can construct a Vaex DataFrame from a variety of in-memory data representations.

And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.

Writing the file into non-.npy versions (answering Q Part 3)

Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.

Different libraries seems to have different solutions for storage. E.g.

  • vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
  • sframe saves the data into their own binary format
  • dask export functions save to_hdf() and to_parquet() format
alvas
  • 115,346
  • 109
  • 446
  • 738
1

It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like

df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)

Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..

Joco
  • 803
  • 4
  • 7
1
import numpy as np
import pandas as pd

# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'

# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)

# Define the chunk size
chunk_size = 1000

# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)

# Initialize an empty array to store the data
data = np.empty((0, num_cols))

# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
    # Convert the chunk to a numpy array
    chunk_array = chunk.to_numpy()
    # Append the chunk to the data array
    data = np.append(data, chunk_array, axis=0)

np.save(npy_file, data)

# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```
Ahmed Mohamed
  • 866
  • 1
  • 4
0

I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.

Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see: save numpy array in append mode

For example using the NpyAppendArray class from Michael's answer you can do:

with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
    for line in csv:
        row = np.fromstring(line, sep=',')
        npy.append(row[np.newaxis, :])

The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:

batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
    done = False
    while not done:
        batch = []
        for count, line in enumerate(csv):
            row = np.fromstring(line, sep=',')
            batch.append(row)
            if count + 1 >= batch_lines:
                break
        else:
            done = True
        npy.append(np.array(batch))

(code is not tested)

user7138814
  • 1,991
  • 9
  • 11