How to import text data file with irregular columns in rows?

Question

I am trying to import a text file using numpy.loadtxt. The data file contains more than 10 thousand rows, most of which has 33 columns. But, there are few rows which have only one column instead of 33. I have tried with numpy.loadtxt and genfromtxt but got error messages. How could I import such a data file in python?

Try `genfromtxt`: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt — tom10, Aug 17 '15 at 03:03
Hi tom10, using genfromtxt, I can skip only one row which has uneven column. But, there are a total 100 such rows. How can I skip them all? — PyLabour, Aug 17 '15 at 03:09
How do you want to store them? You can just read them line by line — kampta, Aug 17 '15 at 03:12
Hi Kampta, I want to skip the rows which have only one column. — PyLabour, Aug 17 '15 at 03:18

Warren Weckesser · Accepted Answer · 2015-08-17T03:34:46.320

5

If you want to ignore the lines with one column, you can use genfromtxt with the argument invalid_raise=False. For this to work, the first line must have the full number of columns.

For example, here's the file foo.txt:

Read the file using genfromtxt with invalid_raise=False. A warning is generated, but the array of data for the lines with three columns is returned:

In [2]: genfromtxt('foo.txt', invalid_raise=False)
/Users/warren/anaconda/lib/python2.7/site-packages/numpy/lib/npyio.py:1695: ConversionWarning: Some errors were detected !
    Line #3 (got 1 columns instead of 3)
    Line #6 (got 1 columns instead of 3)
  warnings.warn(errmsg, ConversionWarning)
Out[2]: 
array([[ 10.,  20.,  30.],
       [ 40.,  50.,  60.],
       [ 70.,  80.,  90.],
       [ 10.,  20.,  30.],
       [ 40.,  50.,  60.]])

edited Aug 17 '15 at 03:34

answered Aug 17 '15 at 03:27

Warren Weckesser

110,654
19
194
214

Hi Warren, this is exactly what I was looking for. Thank you very much. However, an additional question - I have 1095900 rows in the data file. First, I tried to read all of them it produced an error message 'MemoryError'. Then, I divide the data in two files and it works. Although, I can work with the divided files, just wondering is there any way to read the whole file once? – PyLabour Aug 17 '15 at 03:55
Sorry, I haven't looked into how much memory overhead `genfromtxt` has, but it wouldn't surprise me if it is a lot. – Warren Weckesser Aug 17 '15 at 03:59
That's fine. Thank you once again. – PyLabour Aug 17 '15 at 04:00
`genfromtxt` creates a list of lists (one sublist per line), and produces the array with one final call. It's written in Python, so you can check the details yourself. – hpaulj Aug 17 '15 at 04:59
Hi hpaulj, yes, I have done it by genformtxt. Thank you once again. – PyLabour Aug 18 '15 at 23:03

score 2 · Answer 2 · edited May 23 '17 at 10:27

2

genfromtxt accepts any iterable or generator that gives it one line at a time. So instead giving it a file(name), I'd write a little generator function that reads the file, and skips the lines with the wrong number of columns.

This way of using genfromtxt has been discussed in previous SO questions. The most recent asked to read selected rows from a file.

How to read only specific rows from a text file?

edited May 23 '17 at 10:27

Community

1
1

answered Aug 17 '15 at 03:26

hpaulj

221,503
14
230
353

Hi hpaulj, thanks for the answer. However, I have got Warren's one more convenient. – PyLabour Aug 17 '15 at 03:59

score 1 · Answer 3 · answered Aug 17 '15 at 13:13

For the large-file aspect of this problem, you might consider using pandas.read_table, which lets you read files in chunks and has similar file-reading utilities. Here's the basic idea, using Warren's example file:

import pandas as pd

data_reader = pd.read_table("foo.txt", header=None, sep=r' ', dtype=float, chunksize=3)
for chunk in data_reader:
    data = chunk.dropna()
    print data.values

This yields three numpy arrays:

[[ 10.  20.  30.]
 [ 40.  50.  60.]]
[[ 70.  80.  90.]
 [ 10.  20.  30.]]
[[ 40.  50.  60.]]

The keywords you need to pass read_table are a little different from the ones for loadtxt, for instance, here I used sep=r' ' to fit the format of Warren's file, and I set the dtype=float so that Nans would be supported. That lets my use the dropna() method to drop those lines. Finally, getting the .values attribute returns a numpy.ndarray. There is lots of other help on SO for tweaking read_table, so I won't go into detail here. Hope this helps.

Hi Tim..yes, this is also a nice way to do..thank you very much. — PyLabour, Aug 18 '15 at 23:02

How to import text data file with irregular columns in rows?

3 Answers3