Filling missing values using numpy.genfromtxt

Question

Despite the advice from the previous questions:

-9999 as missing value with numpy.genfromtxt()

Using genfromtxt to import csv data with missing values in numpy

I still am unable to process a text file that ends with a missing value,

a.txt:

1 2 3
4 5 6
7 8

I've tried multiple arrangements of options of missing_values, filling_values and can not get this to work:

import numpy as np

sol = np.genfromtxt("a.txt", 
                    dtype=float,
                    invalid_raise=False, 
                    missing_values=None,
                    usemask=True,
                    filling_values=0.0)
print sol

What I would like to get is:

[[1.0 2.0 3.0]
 [4.0 5.0 6.0]
 [7.0 8.0 0.0]]

but instead I get:

/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py:1641: ConversionWarning: Some errors were detected !
    Line #3 (got 2 columns instead of 3)
  warnings.warn(errmsg, ConversionWarning)
[[1.0 2.0 3.0]
 [4.0 5.0 6.0]]

@Ophion No, presume that the text file is fixed as is. I can certainly load the file with normal python with a few `str.split`'s but the question is how to do the same with `numpy.genfromtxt`. — Hooked, Jun 25 '13 at 21:14
Would you be interested in a solution using pandas? (It's dead simple). — unutbu, Jun 25 '13 at 21:24
Numpy's IOtools uses line.split(delimiter). Im not sure there is a way around it unless the columns are a fixed number of characters across. As mentioned py pandas is really great- my life became much simpler once I made the jump. — Daniel, Jun 25 '13 at 21:30
From the docs [docs](http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html), "When spaces are used as delimiters, or when no delimiter has been given as input, there should not be any missing data between two fields." You simply can't do what you want. — Gerrat, Jun 25 '13 at 21:32

score 8 · Answer 1 · answered Jun 25 '13 at 21:36

8

Using pandas:

import pandas as pd

df = pd.read_table('data', sep='\s+', header=None)
df.fillna(0, inplace=True)
print(df)
#    0  1  2
# 0  1  2  3
# 1  4  5  6
# 2  7  8  0

pandas.read_table replaces missing data with NaNs. You can replace those NaNs with some other value using df.fillna.

df is a pandas.DataFrame. You can access the underlying NumPy array with df.values:

print(df.values)
# [[ 1.  2.  3.]
#  [ 4.  5.  6.]
#  [ 7.  8.  0.]]

answered Jun 25 '13 at 21:36

unutbu

842,883
184
1,785
1,677

You can add `dtype=float` keyword to `pd.read_table` to get the data type he wants...+1 tho – dawg Jun 25 '13 at 21:42
I appreciate the answer and will look into pandas for the future. The question dealt specifically with `genformtxt` and by extension `numpy` so I'm accepting the other answer based off that. – Hooked Jun 25 '13 at 23:42

score 4 · Accepted Answer · answered Jun 25 '13 at 21:30

The issue is that numpy doesn't like ragged arrays. Since there is no character in the third position of the last row of the file, so genfromtxt doesn't even know it's something to parse, let alone what to do with it. If the missing value had a filler (any filler) such as:

1 2 3
4 5 6
7 8 ''

Then you'd be able to:

sol = np.genfromtxt("a.txt",
                dtype=float,
                invalid_raise=False,
                missing_values='',
                usemask=False,
                filling_values=0.0)

and: sol

array([[  1.,   2.,   3.],
       [  4.,   5.,   6.],
       [  7.,   8.,  nan]])

Unfortunately, if making the columns of the file uniform isn't an option, you might be stuck with line-by-line parsing.

One other possibility would be IF all the "short" rows are at the end... in which case you might be able to utilize the 'usecols' flag to parse all columns that are uniform, and then the skip_footer flag to do the same for the remaining columns while skipping those that aren't available:

sol = np.genfromtxt("a.txt",
                dtype=float,
                invalid_raise=False,
                usemask=False,
                filling_values=0.0,
                usecols=(0,1))
sol
array([[ 1.,  2.],
   [ 4.,  5.],
   [ 7.,  8.]])

sol2 = np.genfromtxt("a.txt",
                dtype=float,
                invalid_raise=False,
                usemask=False,
                filling_values=0.0,
                usecols=(2,),
                skip_footer=1)
sol2
array([ 3.,  6.])

And then combine the arrays from there adding the fill value:

sol2=np.append(sol2, 0.0)
sol2=sol2.reshape(3,1)
sol=np.hstack([sol,sol2])
sol
array([[ 1.,  2.,  3.],
   [ 4.,  5.,  6.],
   [ 7.,  8.,  0.]])

Thanks, I didn't think about the `usecols` solution, I should generally know ahead of time which column is going to be missing and the bad row will always be at the end. — Hooked, Jun 25 '13 at 23:39

score -1 · Answer 3 · answered Apr 24 '18 at 10:00

In my experience the best is to just parse manually, this function works for me, it might be slow but generally fast enough.

def manual_parsing(filename,delim,dtype):
    out = list()
    lengths = list()
    with open(filename,'r') as ins:
        for line in ins:
            l = line.split(delim)
            out.append(l)
            lengths.append(len(l))
    lim = np.max(lengths)
    for l in out:
        while len(l)<lim:
            l.append("nan")
    return np.array(out,dtype=dtype)

Filling missing values using numpy.genfromtxt

3 Answers3

Linked