I am trying to import a text file using numpy.loadtxt. The data file contains more than 10 thousand rows, most of which has 33 columns. But, there are few rows which have only one column instead of 33. I have tried with numpy.loadtxt and genfromtxt but got error messages. How could I import such a data file in python?
-
Try `genfromtxt`: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt – tom10 Aug 17 '15 at 03:03
-
Hi tom10, using genfromtxt, I can skip only one row which has uneven column. But, there are a total 100 such rows. How can I skip them all? – PyLabour Aug 17 '15 at 03:09
-
How do you want to store them? You can just read them line by line – kampta Aug 17 '15 at 03:12
-
Hi Kampta, I want to skip the rows which have only one column. – PyLabour Aug 17 '15 at 03:18
3 Answers
If you want to ignore the lines with one column, you can use genfromtxt
with the argument invalid_raise=False
. For this to work, the first line must have the full number of columns.
For example, here's the file foo.txt
:
10 20 30
40 50 60
99
70 80 90
10 20 30
99
40 50 60
Read the file using genfromtxt
with invalid_raise=False
. A warning is generated, but the array of data for the lines with three columns is returned:
In [2]: genfromtxt('foo.txt', invalid_raise=False)
/Users/warren/anaconda/lib/python2.7/site-packages/numpy/lib/npyio.py:1695: ConversionWarning: Some errors were detected !
Line #3 (got 1 columns instead of 3)
Line #6 (got 1 columns instead of 3)
warnings.warn(errmsg, ConversionWarning)
Out[2]:
array([[ 10., 20., 30.],
[ 40., 50., 60.],
[ 70., 80., 90.],
[ 10., 20., 30.],
[ 40., 50., 60.]])

- 110,654
- 19
- 194
- 214
-
Hi Warren, this is exactly what I was looking for. Thank you very much. However, an additional question - I have 1095900 rows in the data file. First, I tried to read all of them it produced an error message 'MemoryError'. Then, I divide the data in two files and it works. Although, I can work with the divided files, just wondering is there any way to read the whole file once? – PyLabour Aug 17 '15 at 03:55
-
Sorry, I haven't looked into how much memory overhead `genfromtxt` has, but it wouldn't surprise me if it is a lot. – Warren Weckesser Aug 17 '15 at 03:59
-
-
`genfromtxt` creates a list of lists (one sublist per line), and produces the array with one final call. It's written in Python, so you can check the details yourself. – hpaulj Aug 17 '15 at 04:59
-
genfromtxt
accepts any iterable or generator that gives it one line at a time. So instead giving it a file(name), I'd write a little generator function that reads the file, and skips the lines with the wrong number of columns.
This way of using genfromtxt
has been discussed in previous SO questions. The most recent asked to read selected rows from a file.
-
Hi hpaulj, thanks for the answer. However, I have got Warren's one more convenient. – PyLabour Aug 17 '15 at 03:59
For the large-file aspect of this problem, you might consider using pandas.read_table
, which lets you read files in chunks and has similar file-reading utilities. Here's the basic idea, using Warren's example file:
import pandas as pd
data_reader = pd.read_table("foo.txt", header=None, sep=r' ', dtype=float, chunksize=3)
for chunk in data_reader:
data = chunk.dropna()
print data.values
This yields three numpy arrays:
[[ 10. 20. 30.]
[ 40. 50. 60.]]
[[ 70. 80. 90.]
[ 10. 20. 30.]]
[[ 40. 50. 60.]]
The keywords you need to pass read_table
are a little different from the ones for loadtxt
, for instance, here I used sep=r' '
to fit the format of Warren's file, and I set the dtype=float
so that Nan
s would be supported. That lets my use the dropna()
method to drop those lines. Finally, getting the .values
attribute returns a numpy.ndarray
.
There is lots of other help on SO for tweaking read_table
, so I won't go into detail here. Hope this helps.

- 1,645
- 1
- 25
- 46