3

I have a dataframe containing numpy array.

I saved it to a csv file.

After loading the csv file, I found that the column containing the numpy array has dtype string.

How to convert it to numpy array using read_csv?

import pandas as pd
import numpy as np

df = pd.DataFrame(columns = ['name', 'sex'])
df.loc[len(df), :] = ['Sam', 'M']
df.loc[len(df), :] = ['Mary', 'F']
df.loc[len(df), :] = ['Ann', 'F']

#insert np.array
df['data'] = ''
df['data'][0] = np.array([2,5,7])
df['data'][1] = np.array([6,4,8])
df['data'][2] = np.array([9,2,1])

#save to csv file
df.to_csv('data.csv', index =False)
#load csv file
df2 = pd.read_csv('data.csv')#data column becomes string, how to change it to np.array?
Chan
  • 3,605
  • 9
  • 29
  • 60
  • what version of python are you using – aydow Aug 06 '18 at 06:55
  • How do you want to save the NumPy array in the CSV file, between other fields? – 9769953 Aug 06 '18 at 06:55
  • @aydow Python 3.6.4 – Chan Aug 06 '18 at 06:57
  • Use numpy.fromstring(text, sep=' ') Define seperator as well. – Upasana Mittal Aug 06 '18 at 06:58
  • @9769953 Sorry, I don't understand your meaning. – Chan Aug 06 '18 at 06:58
  • This can help..??: https://stackoverflow.com/questions/3518778/how-do-i-read-csv-data-into-a-record-array-in-numpy – anky Aug 06 '18 at 07:16
  • Your numpy array is a single cell in your dataframe. Writing that out means it's a single cell in the CSV file, which doesn't suit an array very well. You can try to represent the array as`...,[1 2 3],...`, but when read back in, that is a single cell with the string "[1 2 3]". You need a proper reader that transforms such strings into a NumPy array (which could also be done after reading), or write out the array as e.g. `...,1,2,3,...`. But the latter only works if each array has the same length. – 9769953 Aug 06 '18 at 07:25
  • after reading the file you can do something like this`df2['data'] = [np.array(i) for i in df2.data]`. I dont think its possible while reading the file. – shivsn Aug 06 '18 at 07:30
  • @shivsn It returns `array('[2 5 7]', dtype=' – Chan Aug 06 '18 at 07:34
  • @anky_91 I wrote `data2 = genfromtxt('data.csv', delimiter=',')`. It returns `array([[nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan]])` – Chan Aug 06 '18 at 07:36
  • @shivsn Please note that `array('[2 5 7]', dtype=' – Chan Aug 06 '18 at 10:40

2 Answers2

1

Its a workaround:

In [114]: df2['data'] = df2.data.str.split(' ',expand=True).replace('\[|\]','',regex=True).astype(int).values.tolist()

In [115]: df2['data'] = [np.array(i) for i in df2.data]

In [116]: df2.loc[0,'data']
Out[116]: array([2, 5, 7])
shivsn
  • 7,680
  • 1
  • 26
  • 33
0

Pandas has only 7 datatypes: Object, float, int, bool, datetime, timedelta and category. So list, string, array etc. is treated as object datatype only. You can read more about it in http://pbpython.com/pandas_dtypes.html You can use astype function to convert between these datatypes only.

Sreekiran A R
  • 3,123
  • 2
  • 20
  • 41