7

I am quite new to nympy and I am trying to read a tab(\t) delimited text file into an numpy array matrix using the following code:

train_data = np.genfromtxt('training.txt', dtype=None, delimiter='\t')

File contents:

38   Private    215646   HS-grad    9    Divorced    Handlers-cleaners   Not-in-family   White   Male   0   0   40   United-States   <=50K
53   Private    234721   11th   7    Married-civ-spouse  Handlers-cleaners   Husband     Black   Male   0   0   40   United-States   <=50K
30   State-gov  141297   Bachelors  13   Married-civ-spouse  Prof-specialty  Husband     Asian-Pac-Islander  Male   0   0   40   India   >50K

what I expect is a 2-D array matrix of shape (3, 15)

but with my above code I only get a single row array of shape (3,)

I am not sure why those fifteen fields of each row are not assigned a column each.

I also tried using numpy's loadtxt() but it could not handle type conversions on my data i.e even though I gave dtype=None it tried to convert the strings to default float type and failed at it.

Tried code:

train_data = np.loadtxt('try.txt', dtype=None, delimiter='\t')

Error:
ValueError: could not convert string to float: State-gov

Any pointers?

Thanks

Abhi
  • 163
  • 1
  • 1
  • 8
  • Have you tried stating something like "dtype=String"? – abiessu Oct 06 '13 at 20:52
  • Oh I could resolve this using a more traditional file reads (using csv reader) – Abhi Oct 06 '13 at 21:20
  • Thanks @abiessu. dtype=np.str works fine but I would not want to convert all of them to strings. Hence I was relying on dtype=None to do the auto typecasting for me where it gives 'int' or 'float' a higher precedence over Strings when dealing with numbers – Abhi Oct 06 '13 at 21:32

3 Answers3

4

Actually the issue here is that np.genfromtxt and np.loadtxt both return a structured array if the dtype is structured (i.e., has multiple types). Your array reports to have a shape of (3,), because technically it is a 1d array of 'records'. These 'records' hold all your columns but you can access all the data as if it were 2d.

You are loading it correctly:

In [82]: d = np.genfromtxt('tmp',dtype=None)

As you reported, it has a 1d shape:

In [83]: d.shape
Out[83]: (3,)

But all your data is there:

In [84]: d
Out[84]: 
array([ (38, 'Private', 215646, 'HS-grad', 9, 'Divorced', 'Handlers-cleaners', 'Not-in-family', 'White', 'Male', 0, 0, 40, 'United-States', '<=50K'),
       (53, 'Private', 234721, '11th', 7, 'Married-civ-spouse', 'Handlers-cleaners', 'Husband', 'Black', 'Male', 0, 0, 40, 'United-States', '<=50K'),
       (30, 'State-gov', 141297, 'Bachelors', 13, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'Asian-Pac-Islander', 'Male', 0, 0, 40, 'India', '>50K')], 
      dtype=[('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])

The dtype of the array is structured as so:

In [85]: d.dtype
Out[85]: dtype([('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])

And you can still access "columns" (known as fields) using the names given in the dtype:

In [86]: d['f0']
Out[86]: array([38, 53, 30])

In [87]: d['f1']
Out[87]: 
array(['Private', 'Private', 'State-gov'], 
      dtype='|S9')

It's more convenient to give proper names to the fields:

In [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income"

In [105]: d = np.genfromtxt('tmp',dtype=None, names=names)

So you can now access the 'age' field, etc.:

In [106]: d['age']
Out[106]: array([38, 53, 30])

In [107]: d['income']
Out[107]: 
array(['<=50K', '<=50K', '>50K'], 
      dtype='|S5')

Or the incomes of people under 35

In [108]: d[d['age'] < 35]['income']
Out[108]: 
array(['>50K'], 
      dtype='|S5')

and over 35

In [109]: d[d['age'] > 35]['income']
Out[109]: 
array(['<=50K', '<=50K'], 
      dtype='|S5')
askewchan
  • 45,161
  • 17
  • 118
  • 134
2

Updated answer

Sorry, I misread your original question:

what I expect is a 2-D array matrix of shape (3, 15)

but with my above code I only get a single row array of shape (3,)

I think you misunderstand what np.genfromtxt() will return. In this case, it will try to infer the type of each 'column' in your text file and give you back a structured / "record" array. Each row will contain multiple fields (f0...f14), each of which can contain values of a different type corresponding to a 'column' in your text file. You can index a particular field by name, e.g. data['f0'].

You simply can't have a (3,15) numpy array of heterogeneous types. You can have a (3,15) homogeneous array of strings, for example:

>>> string_data = np.genfromtext('test', dtype=str, delimiter='\t')
>>> print string_data.shape
(3, 15)

Then of course you could manually cast the columns to whatever type you want, as in @DrRobotNinja's answer. However you might as well let numpy create a structured array for you, then index it by field and assign the columns to new arrays.

Community
  • 1
  • 1
ali_m
  • 71,714
  • 23
  • 223
  • 298
1

I do not believe Numpy arrays handle different datatypes within a single array. What can be done, is load the entire array as strings, then convert the necessary columns to numbers as necessary

# Load data as strings
train_data = np.loadtxt('try.txt', dtype=np.str, delimiter='\t')

# Convert numeric strings into integers
first_col = train_data[:,0].astype(np.int)
third_col = train_data[:,2].astype(np.int)
DrRobotNinja
  • 1,381
  • 12
  • 14
  • Actually numpy arrays can have a structured dtype, see my answer. – askewchan Oct 07 '13 at 20:52
  • @askewchan the problem is, that numpy seems to be not able to deal with different field types to build up an two-dimensional array, so DrRobotNinja 's answer is quite helpful – ngeek Sep 08 '17 at 15:52
  • @ngeek I encourage you to learn about how numpy deals with different field types in a single array, rather than trying to outsmart it by converting all numbers to strings. It can be quite useful! See the other two answers here. – askewchan Sep 08 '17 at 16:15