3

I would like to read a CSV file with numpy.loadtxt. I know that I can specify the columns I want to read with the usecols parameter. However, what I actually want to do is to specify a list of columns not to read. This is because I don't actually know how many columns my file will contain.

Is there any way to do this, other than reading the first few lines of the file, determining the total number of columns and then manually calculating the set of columns to read?

Nils
  • 1,936
  • 3
  • 27
  • 42

1 Answers1

2

Not without reading the first line, as you mentioned.

However, it might be easier to do:

do_not_read_cols = [3, 4, 9]
data = np.loadtxt('filename')
data = np.delete(data, do_read_cols, axis=1)

This won't be terribly memory-efficient, but loadtxt doesn't try to be very memory-efficient to begin with. Unless you're deleting the majority of the columns, you'll use more memory with the call to loadtxt than you will with the subsequent temporary copy that delete will make.


To expand on my comment below, if you want to be memory-efficient and don't want to use pandas, another option is something like this: (Note: written a bit sloppily.)

import numpy as np

def generate_text_file(length=1e6, ncols=20):
    data = np.random.random((length, ncols))
    np.savetxt('large_text_file.csv', data, delimiter=',')

def iter_loadtxt(filename, delimiter=',', skiprows=0, skipcols=None,dtype=float):
    if skipcols is None:
        skipcols = []
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for i, item in enumerate(line):
                    if i in skipcols:
                        continue
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line) - len(skipcols)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

#generate_text_file()
data = iter_loadtxt('large_text_file.csv')
Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • Thanks, but unfortunately memory efficiency is the whole reason for doing this in the first place, so this doesn't work for me. – Nils Jan 09 '14 at 14:42
  • @Nils - If you're worried about memory efficiency, don't use `loadtxt`. It will use ~8x the memory necessary to load the array. (Not to plug my own answer, but see this as an example: http://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779#8964779) `pandas` is actually quite memory-efficient, if you want to go that route. Because `pandas` effectively stores _each column_ in its own array, dropping a set of columns won't require making a copy. Alternately, you could write your own loading generator in a few lines and read it in with `np.fromiter`. – Joe Kington Jan 09 '14 at 14:46
  • I am working on fixing some existing code, so I would like to avoid rewriting the whole method to use a different package altogether. But if it had been my decision I would have probably used pandas in the first place, yes. :) – Nils Jan 09 '14 at 14:55
  • memory might not be the only reason, as the datafile could contain one column of text (to ignore) and rest of floats (to be read) - however the last solution will work for this case – gluuke Feb 15 '19 at 16:06