3

I am having trouble reading a csv file, delimited by tabs, in python. I use the following function:

def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
    """
    Parse a file name into an array. Return the array and additional header lines. By default,
    parse the header lines into dictionaries, assuming the parameters are numeric,
    using 'parse_header'.
    """
    f = open(filename, 'r')
    skipped_rows = []
    for n in range(skiprows):
        header_line = f.readline().strip()
        if raw_header:
            skipped_rows.append(header_line)
        else:
            skipped_rows.append(parse_header(header_line))
    f.close()
    if missing:
        data = genfromtxt(filename, dtype=None, names=with_header,
                          deletechars='', skiprows=skiprows, missing=missing)
    else:
    if delimiter != '\t':
        data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
                  deletechars='', skiprows=skiprows)
    else:
        data = genfromtxt(filename, dtype=None, names=with_header,
                  deletechars='', skiprows=skiprows)        
    if data.ndim == 0:
    data = array([data.item()])
    return (data, skipped_rows)

the problem is that genfromtxt complains about my files, e.g. with the error:

Line #27100 (got 12 columns instead of 16)

I am not sure where these errors come from. Any ideas?

Here's an example file that causes the problem:

#Gene   120-1   120-3   120-4   30-1    30-3    30-4    C-1 C-2 C-5 genesymbol  genedesc
ENSMUSG00000000001  7.32    9.5 7.76    7.24    11.35   8.83    6.67    11.35   7.12    Gnai3   guanine nucleotide binding protein alpha
ENSMUSG00000000003  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn    probasin

Is there a better way to write a generic csv2array function? thanks.

  • It appears that when it gets to the third line in the file it thinks that there are 16 columns (based on line 2 for some reason) and then rejects the file. Any idea why the last field of line 2 would be interpreted that way? It has no tabs, only spaces, but it seems to interpret each word in the last field of line 2 as a column field. –  May 18 '10 at 17:20
  • Your parser must be interpreting spaces as delimiters. I'm not sure what the genfromtxt does, but if it's building an array, it might silently expand itself if you feed it a row bigger than any other previously, but then get angry when it gets a smaller one. In any case, using the `csv` module is much more robust if you're dealing with potentially unknown data. – Nick T May 18 '10 at 19:56
  • but how can I go from csv to an array though robustly? –  May 18 '10 at 23:17
  • Did you try to specify '\t' as the delimiter to genfromtxt? – Stefan van der Walt May 21 '10 at 03:48

5 Answers5

6

Check out the python CSV module: http://docs.python.org/library/csv.html

import csv
reader = csv.reader(open("myfile.csv", "rb"), 
                    delimiter='\t', quoting=csv.QUOTE_NONE)

header = []
records = []
fields = 16

if thereIsAHeader: header = reader.next()

for row, record in enumerate(reader):
    if len(record) != fields:
        print "Skipping malformed record %i, contains %i fields (%i expected)" %
            (record, len(record), fields)
    else:
        records.append(record)

# do numpy stuff.
Nick T
  • 25,754
  • 12
  • 83
  • 121
  • 2
    this does not make a numpy array out of the result, unfortunately –  May 18 '10 at 17:16
  • You can do whatever you like with the data in the loop body; there it's a list broken up by delimiter. You could check if it's as long as you expect, (in edited example), or do validation on each field to make sure you're not passing garbage into your numpy array. – Nick T May 18 '10 at 20:00
2

May I ask why you're not using the built-in csv reader? http://docs.python.org/library/csv.html

I've used it very effectively with numpy/scipy. I would share my code but unfortunately it's owned by my employer, but it should be very straightforward to write your own.

Uri
  • 88,451
  • 51
  • 221
  • 321
0

I think Nick T's approach would be the better way to go. I would make one change. As I would replace the following code:

for row, record in enumerate(reader):
if len(record) != fields:
    print "Skipping malformed record %i, contains %i fields (%i expected)" %
        (record, len(record), fields)
else:
    records.append(record)

with

records = np.asrray([row for row in reader if len(row) = fields ])
print('Number of skipped records: %i'%(len(reader)-len(records)) #note you have to do more than len(reader) as an iterator does not have a length like a list or tuple

The list comprehension will return a numpy array and take advantage of pre-compiled libraries which should speed things up greatly. Also, I would recommend using print() as a function versus print "" as the former is the standard for python3 which is most likely the future and I would use logging over print.

William komp
  • 1,237
  • 9
  • 4
0

Likely it came from Line 27100 in your data file... and it had 12 columns instead of 16. I.e. it had:

separator,1,2,3,4,5,6,7,8,9,10,11,12,separator

And it was expecting something like this:

separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator

I'm not sure how you want to convert your data, but if you have irregular line lengths, the easiest way would be something like this:

lines = f.read().split('someseparator')
for line in lines:
    splitline = line.split(',')
    #do something with splitline
user
  • 5,370
  • 8
  • 47
  • 75
Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
  • I added an example file that leads to the error -- it looks to me like it has the right number of columns but for some reason it thinks it has 16 columns. Any idea what causes this? –  May 18 '10 at 17:16
0

I have successfully used two methodologies; (1): if I simply need to read arbitrary CSV, I used the CSV module (as pointed out by other users), and (2): if I require repeated processing of a known CSV (or any) format, I write a simple parser.

It seems that your problem fits in the second category, and a parser should be very simple:

f = open('file.txt', 'r').readlines()
for line in f:
 tokens = line.strip().split('\t')
 gene = tokens[0]
 vals = [float(k) for k in tokens[1:10]]
 stuff = tokens[10:]
 # do something with gene, vals, and stuff

You can add a line in the reader for skipping comments (`if tokens[0] == '#': continue') or to handle blank lines ('if tokens == []: continue'). You get the idea.

Escualo
  • 40,844
  • 23
  • 87
  • 135