0

I am importing a 153,673*25 csv data matrix with integers, floats and strings using pandas, through the IPython console in Anaconda's Spyder (Python 2). I then want to transform this data into a structured array, by specifying the column names through the pandaframe columns names and the types manually. Here is the code - functions importing_data.run() and attributes_names.run() respectively import the csv data in pandaframe format and extract the column names of the pandaframe as a list:

import pandas
import numpy
import importing_data
import attributes_names

csv_data    = importing_data.run()
names       = attributes_names.run(csv_data)

type_list   = ['int',
               'str',
               'str',
                ...
               'float',
               'int',
               'int',
              ]

data_type   = zip(names,type_list)

n_rows      = len(csv_data.ix[:,0])
n_columns   = len(csv_data.ix[0,:])
data_sample = numpy.zeros((n_rows,n_columns),dtype=data_type)

for i in range(0,n_columns):
    column              = csv_data.ix[:,i].values
    data_sample[:,i]    = column

However, the final loop seems to be failing: it sometimes pushes the kernel to restart, and when it doesn't the data_sample array has an unexpected structure; I can't precisely describe it as lately I've only have kernel restarts, but I believe it was a 153,673*25 dimensional array made up of 153,673 dimensional lists.

What am I doing wrong here?


Edit

A first mistake I was making is the following: instead of

data_sample = numpy.zeros((n_rows,n_columns),dtype=data_type)

I have to put:

data_sample = numpy.zeros((n_rows,1),dtype=data_type)

I have redefined the loop as follows:

for i in range(0,n_rows):
    data_sample[i,0] = csv_data.values[i,:]

But now I get the following error message: TypeError: expected a single-segment buffer object

Daneel Olivaw
  • 2,077
  • 4
  • 15
  • 23
  • It's not very clear what are you trying to achieve... Can you provide a small sample data set (3-5 rows) and desired data set? Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU - stand with Ukraine Feb 05 '17 at 14:27
  • Try initiating `data_sample` to `(nrows,)`; and do `data_sample[row]=tuple(csvdata...)`. – hpaulj Feb 05 '17 at 16:13

1 Answers1

0

Reconstructing your problem without all the pandas complications:

In [695]: names=['a','b','c']
In [696]: type_list=['int','float','int']
In [697]: datatype=list(zip(names,type_list))
In [698]: dt = np.dtype(datatype)
In [699]: dt
Out[699]: dtype([('a', '<i4'), ('b', '<f8'), ('c', '<i4')])

Make a data array lilke csv_data.values. Since you are expecting strings and numbers I suspect this is an object dtype array (pandas resorts to that dtype quite often)

In [712]: data = np.arange(12).reshape(4,3).astype(object)
In [713]: data
Out[713]: 
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8],
       [9, 10, 11]], dtype=object)

Create the target structured array. Note it is 1d, 4 elements (rows/records), with 3 fields (from the dtype)

In [714]: A = np.zeros((4,), dtype=dt)
In [715]: A
Out[715]: 
array([(0,  0., 0), (0,  0., 0), (0,  0., 0), (0,  0., 0)], 
      dtype=[('a', '<i4'), ('b', '<f8'), ('c', '<i4')])

The input to a structured array should be a tuple or list of tuples

In [716]: for i in range(4):
     ...:     A[i] = tuple(data[i,:])

In [717]: A
Out[717]: 
array([(0,   1.,  2), (3,   4.,  5), (6,   7.,  8), (9,  10., 11)], 
      dtype=[('a', '<i4'), ('b', '<f8'), ('c', '<i4')])

Assigning a list works, but stores unexpected values. I suspect it is doing byte copies, without paying attention to the dtype.

In [718]: for i in range(4):
     ...:     A[i] = data[i,:]

In [719]: A
Out[719]: 
array([(139402288,   1.17777468e-268, 0),
       (139402336,   1.17780241e-268, 0),
       (139402384,   1.17783014e-268, 0), (139402432,   1.17785787e-268, 0)], 
      dtype=[('a', '<i4'), ('b', '<f8'), ('c', '<i4')])

I could also create A directly, it the data is a list of tuples

In [720]: d = [tuple(r) for r in data]
In [721]: d
Out[721]: [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)]
In [722]: A=np.array(d, dtype=dt)
In [723]: A
Out[723]: 
array([(0,   1.,  2), (3,   4.,  5), (6,   7.,  8), (9,  10., 11)], 
      dtype=[('a', '<i4'), ('b', '<f8'), ('c', '<i4')])

You can also assign values by field name. Often this is faster since there are usually more rows than fields

In [725]: for i,n in enumerate(dt.names):
     ...:     print(i,n)
     ...:     A[n] = data[:,i]
     ...:     
0 a
1 b
2 c
In [726]: A
Out[726]: 
array([(0,   1.,  2), (3,   4.,  5), (6,   7.,  8), (9,  10., 11)], 
      dtype=[('a', '<i4'), ('b', '<f8'), ('c', '<i4')])
hpaulj
  • 221,503
  • 14
  • 230
  • 353