1

I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.

It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.

Here's some test code:

import pandas as pd
import numpy as np

# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df  = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)

# Now read the file into memory.
arr = np.fromfile(filename)

print len(arr)

I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?

The docs show an optional sep parameter. But when that is used:

arr = np.fromfile(filename, sep = ',')

...we get a length of 0?!

Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.

What am I missing here?

elPastor
  • 8,435
  • 11
  • 53
  • 81

1 Answers1

2

numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.

From the docs:

A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.

By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.

To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).

araraonline
  • 1,502
  • 9
  • 14
  • Thanks araraonline, but to the part that says "parsing simply formatted text files" - what are those simply formatted text files then? You'd think they should be able to be created in any text editor and not require `numpy.tofile` – elPastor Mar 06 '19 at 20:41
  • I've experimented here with `tofile` and it seems they are just an array of numbers split by `sep`. For example: `"1,2,3,4,5,6"` if the values are `[1,2,3,4,5,6]`. Also, it only stores 1 dimension (higher dimensional arrays are flattened). – araraonline Mar 06 '19 at 20:44
  • This is sending me down a long rabbit hole. I'd have to believe it can read 2D text files like the ones saved using this answer: https://stackoverflow.com/questions/3685265/how-to-write-a-multidimensional-array-to-a-text-file. – elPastor Mar 06 '19 at 20:49
  • From a quick glance, it seems like they're using `savetxt` and `loadtxt` (instead of `fromfile`)... Maybe that's the source of your confusion (?) – araraonline Mar 06 '19 at 20:56