0

my question is origin from this answer by Phil. the code is

df = pd.DataFrame([[1,31,2.5,1260759144], [1,1029,3,1260759179],
                    [1,1061,3,1260759182],[1,1129,2,1260759185],
                    [1,1172,4,1260759205],[2,31,3,1260759134],
                    [2,1111,4.5,1260759256]],
                   index=list(['a','c','h','g','e','b','f',]),  
                   columns=list( ['userId','movieId','rating','timestamp']) )
df.index.names=['ID No.']
df.columns.names=['Information']

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """
    v = df.values
    cols = df.columns
# df[k].dtype.type  is <class 'numpy.object_'>,I want to convert it to numpy.str
    types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z
sa = df_to_sarray(df.reset_index())
print(sa)

Phil's answer works well, while if I run

sa = df_to_sarray(df.reset_index())

I will get the following result.

array([('a', 1, 31, 2.5, 1260759144), ('c', 1, 1029, 3.0, 1260759179),
       ('h', 1, 1061, 3.0, 1260759182), ('g', 1, 1129, 2.0, 1260759185),
       ('e', 1, 1172, 4.0, 1260759205), ('b', 2, 31, 3.0, 1260759134),
       ('f', 2, 1111, 4.5, 1260759256)], 
      dtype=[('ID No.', 'O'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')])

I hope I can get dtype as following.

dtype=[('ID No.', 'S'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')]

string instead of object.

I tested the type of df[k].dtype.type , I found it is <class 'numpy.object_'>,I want to convert it to numpy.str. how to do that?

ayhan
  • 70,170
  • 20
  • 182
  • 203
Renke
  • 452
  • 6
  • 22
  • Have you tried ```df[col].astype(str)```? – Quentin Jun 21 '17 at 01:25
  • `types` is a iist. So you should be able to change the first tuple. which presumably is `('ID No.', 'O')`. – hpaulj Jun 21 '17 at 01:27
  • I would only convert 'object' type to 'string', for other columns with type 'int', I would like to keep them as 'int'. – Renke Jun 21 '17 at 01:28
  • yeah, I try to use [cols[i], df[k].dtype.type], and get the following result. print(types) , [['ID No.', ], ['userId', ], ['movieId', ], ['rating', ], ['timestamp', ]] – Renke Jun 21 '17 at 01:31

1 Answers1

1

After reset_index the dtypes of your dataframe are a mix of object and numbers. The indexing has been rendered as object, not strings.

In [9]: df1=df.reset_index()
In [10]: df1.dtypes
Out[10]: 
Information
ID No.        object
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

df1.values is a (7,5) object dtype array.

With the correct dtype, your approach does nicely (I'm use 'U2' on Py3):

In [31]: v = df1.values
In [32]: dt1=np.dtype([('ID No.', 'U2'), ('userId', '<i8'), ('movieId', '<i8'), 
    ...: ('rating', '<f8'), ('timestamp', '<i8')])
In [33]: z = np.zeros(v.shape[0], dtype=dt1)
In [34]: 
In [34]: for i,k in enumerate(dt1.names):
    ...:     z[k] = v[:, i]
    ...:     
In [35]: z
Out[35]: 
array([('a', 1,   31,  2.5, 1260759144), ('c', 1, 1029,  3. , 1260759179),
       ('h', 1, 1061,  3. , 1260759182), ('g', 1, 1129,  2. , 1260759185),
       ('e', 1, 1172,  4. , 1260759205), ('b', 2,   31,  3. , 1260759134),
       ('f', 2, 1111,  4.5, 1260759256)], 
      dtype=[('ID No.', '<U2'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')])

So the trick is to derive that dt1 from the dataframe.

Editing types after construction is one option:

In [36]: cols=df1.columns
In [37]: types = [(cols[i], df1[k].dtype.type) for (i, k) in enumerate(cols)]
In [38]: types
Out[38]: 
[('ID No.', numpy.object_),
 ('userId', numpy.int64),
 ('movieId', numpy.int64),
 ('rating', numpy.float64),
 ('timestamp', numpy.int64)]
In [39]: types[0]=(types[0][0], 'U2')
In [40]: types
Out[40]: 
[('ID No.', 'U2'),
 ('userId', numpy.int64),
 ('movieId', numpy.int64),
 ('rating', numpy.float64),
 ('timestamp', numpy.int64)]
In [41]: 
In [41]: z = np.zeros(v.shape[0], dtype=types)

Tweaking the column dtype during construction also works:

def foo(atype):
    if atype==np.object_:
        return 'U2'
    return atype
In [59]: types = [(cols[i], foo(df1[k].dtype.type)) for (i, k) in enumerate(cols)]

In either case we have to know ahead of time that we want to turn the object column into a specific string type, and not something more generic.

I don't know enough pandas to say whether it's possible to change the dtype of that ID column before we extract an array. .values will be a object dtype because of the mix of column dtypes.

hpaulj
  • 221,503
  • 14
  • 230
  • 353