I have a PostgreSQL database that has data similar to: date, character varying, character varying, integer[] In the interger array column is stored a list of values: 1,2,3,4,5 I'm using pd.read_sql to read the data into a dataframe.
So I have a dataframe with a date column, several string columns, and then a column with a list of intergers.
The array values are regularly used in numpy arrays to do vector math.
In the past I couldn't find a way to convert the list column to a numpy array column without looping and recreating the dataframe row by row. As an example:
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
new_df = pd.DataFrame(columns=df.columns)
for i in range(len(df)):
new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
print(new_df)
This looping could be over a few thousand rows.
More recently I figured out that if I could do a single line conversion of Series -> list -> nparray -> list -> Series and achieve the result a lot more efficiently.
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)
print(df)
print(type(df['NParray'][0]))
I read about and tried to use Series.array and Series.to_numpy, but they don't really achieve what I'm trying to do.
So, the question is: Is there a method to convert a pd.Series of lists to a numpy array as I'm trying to do? Is there any easier way to mass convert these lists to numpy arrays?
I was hoping for something like simple like:
df['NParray'] =np.asarray(df['Measures'])
df['NParray'] =np.array(df['Measures'])
df['NParray'] =df['Measures'].array
df['NParray'] =df['Measures'].to_numpy()
But these have different functions and do not work for my purpose.
------------Edited after testing------------------------------------------------
I setup a small test to see what the difference in timings and efficiency would be:
import pandas as pd
import numpy as np
def get_dataframe():
col1 = ['String data'] * 10000
col2 = [list(range(0,5000))] * 10000
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
return(df)
def old_looping(df):
new_df = pd.DataFrame(columns=df.columns)
starttime = pd.datetime.now()
for i in range(len(df)):
new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
endtime = pd.datetime.now()
duration = endtime - starttime
print('Looping', duration)
def series_transforms(df):
starttime = pd.datetime.now()
df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)
endtime = pd.datetime.now()
duration = endtime - starttime
print('Transforms', duration)
def use_apply(df):
starttime = pd.datetime.now()
df['Measures'] = df['Measures'].apply(np.array)
endtime = pd.datetime.now()
duration = endtime - starttime
print('Apply', duration)
def run_test(tests):
for i in range(tests):
construct_df = get_dataframe()
old_looping(construct_df)
for i in range(tests):
construct_df = get_dataframe()
series_transforms(construct_df)
for i in range(tests):
construct_df = get_dataframe()
use_apply(construct_df)
run_test(5)
With 10,000 rows the results were:
Transforms 3.945816
Transforms 3.968821
Transforms 3.891866
Transforms 3.859437
Transforms 3.860590
Apply 4.218867
Apply 4.015742
Apply 4.046986
Apply 3.906360
Apply 3.890740
Looping 27.662418
Looping 27.814523
Looping 27.298895
Looping 27.565626
Looping 27.222970
Transforming through Series-List-NP Array-List-Series is negligibly faster than using Apply. Apply is definitely shorter code and possibly easier to understand.
Increasing the number of rows or array length will increase the times by the same magnitude.