0

I'm trying to load a textfile into a numpy array.

The structure is the following:

THE 77534223
AND 30997177
ING 30679488
ENT 17902107
ION 17769261
HER 15277018
FOR 14686159
THA 14222073
NTH 14115952
[...]

But I fail using

import numpy as np

data = np.genfromtxt("english_trigrams.txt", dtype=(str,int), delimiter=' ')                                                   
print(data)

[['TH' '77']
 ['AN' '30']
 ['IN' '30']
 ..., 
 ['JX' '1']
 ['JQ' '1']
 ['JQ' '1']]

I want an (x,2) array with dtype str in the first column and dtype int in the second.

Thanks a lot!


P.s.:

  • Python 3.6.1
  • NumPy 1.13.0
Suuuehgi
  • 4,547
  • 3
  • 27
  • 32
  • maybe try np.loadtxt – Cary Shindell Jul 18 '17 at 15:13
  • 2
    Possible duplicate of [How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?](https://stackoverflow.com/questions/12319969/how-to-use-numpy-genfromtxt-when-first-column-is-string-and-the-remaining-column) – Maximilian Peters Jul 18 '17 at 15:19
  • 1
    `np.loadtxt("english_trigrams.txt", dtype=[('f0', '|S3'),('f1', ' – Maximilian Peters Jul 18 '17 at 15:22
  • just out of curiosity, but do you intend to change `77534223` frome the file to `77` ? – Marvin Taschenberger Jul 18 '17 at 15:34
  • @MaximilianPeters: I've read this post but I did not get anywhere. @MaximilianPeters: This gives me a `dim (17556,)` array. @MarvinTaschenberger: No, I did not change anything. I need to stick with `77534223` :-) – Suuuehgi Jul 18 '17 at 15:56
  • @smoneck: see here for an explanation: https://stackoverflow.com/questions/9534408/numpy-genfromtxt-produces-array-of-what-looks-like-tuples-not-a-2d-array-why – Maximilian Peters Jul 18 '17 at 16:08
  • 1
    *"I want an (x,2) array with dtype str in the first column and dtype int in the second."* That is not possible with numpy. What you *can* get is a one-dimensional structured array, using the approach suggested by @MaximilianPeters. – Warren Weckesser Jul 18 '17 at 16:24

1 Answers1

0

Various ways of loading this text

In [470]: txt=b"""THE 77534223
     ...: AND 30997177
     ...: ING 30679488
     ...: ENT 17902107
     ...: ION 17769261
     ...: HER 15277018
     ...: FOR 14686159
     ...: THA 14222073
     ...: NTH 14115952"""

Let genfromtxt deduce the correct column dtype

In [471]: data = np.genfromtxt(txt.splitlines(),dtype=None)
In [472]: data
Out[472]: 
array([(b'THE', 77534223), (b'AND', 30997177), (b'ING', 30679488),
       (b'ENT', 17902107), (b'ION', 17769261), (b'HER', 15277018),
       (b'FOR', 14686159), (b'THA', 14222073), (b'NTH', 14115952)],
      dtype=[('f0', 'S3'), ('f1', '<i4')])

Not the right dtype specification; like yours but with just 1 char per element.

In [473]: data = np.genfromtxt(txt.splitlines(),dtype=(str, int))
In [474]: data
Out[474]: 
array([['T', '7'],
       ['A', '3'],
       ['I', '3'],
       ['E', '1'],
       ['I', '1'],
       ['H', '1'],
       ['F', '1'],
       ['T', '1'],
       ['N', '1']],
      dtype='<U1')

A little better - but the strings are too short

In [475]: data = np.genfromtxt(txt.splitlines(),dtype='str,int')
In [476]: data
Out[476]: 
array([('', 77534223), ('', 30997177), ('', 30679488), ('', 17902107),
       ('', 17769261), ('', 15277018), ('', 14686159), ('', 14222073),
       ('', 14115952)],
      dtype=[('f0', '<U'), ('f1', '<i4')])

Similar to the dtype=None case

In [477]: data = np.genfromtxt(txt.splitlines(),dtype='U10,int')
In [478]: data
Out[478]: 
array([('THE', 77534223), ('AND', 30997177), ('ING', 30679488),
       ('ENT', 17902107), ('ION', 17769261), ('HER', 15277018),
       ('FOR', 14686159), ('THA', 14222073), ('NTH', 14115952)],
      dtype=[('f0', '<U10'), ('f1', '<i4')])
hpaulj
  • 221,503
  • 14
  • 230
  • 353