0

I have a PostgreSQL database that has data similar to: date, character varying, character varying, integer[] In the interger array column is stored a list of values: 1,2,3,4,5 I'm using pd.read_sql to read the data into a dataframe.

So I have a dataframe with a date column, several string columns, and then a column with a list of intergers.

The array values are regularly used in numpy arrays to do vector math.

In the past I couldn't find a way to convert the list column to a numpy array column without looping and recreating the dataframe row by row. As an example:

import pandas as pd
import numpy as np

col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)

new_df = pd.DataFrame(columns=df.columns)

for i in range(len(df)):
    new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]

print(new_df)

This looping could be over a few thousand rows.

More recently I figured out that if I could do a single line conversion of Series -> list -> nparray -> list -> Series and achieve the result a lot more efficiently.

import pandas as pd
import numpy as np

col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)

df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)

print(df)
print(type(df['NParray'][0]))

I read about and tried to use Series.array and Series.to_numpy, but they don't really achieve what I'm trying to do.

So, the question is: Is there a method to convert a pd.Series of lists to a numpy array as I'm trying to do? Is there any easier way to mass convert these lists to numpy arrays?

I was hoping for something like simple like:

df['NParray'] =np.asarray(df['Measures'])
df['NParray'] =np.array(df['Measures'])
df['NParray'] =df['Measures'].array
df['NParray'] =df['Measures'].to_numpy()

But these have different functions and do not work for my purpose.

------------Edited after testing------------------------------------------------

I setup a small test to see what the difference in timings and efficiency would be:

import pandas as pd
import numpy as np

def get_dataframe():
    col1 = ['String data'] * 10000
    col2 = [list(range(0,5000))] * 10000
    d = {'Description': col1, 'Measures':col2}
    df = pd.DataFrame(d)
    return(df)


def old_looping(df):
    new_df = pd.DataFrame(columns=df.columns)
    starttime = pd.datetime.now()
    for i in range(len(df)):
        new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
    endtime = pd.datetime.now()
    duration = endtime - starttime
    print('Looping', duration)


def series_transforms(df):
    starttime = pd.datetime.now()
    df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
    df.drop(['Measures'], axis=1, inplace=True)
    endtime = pd.datetime.now()
    duration = endtime - starttime
    print('Transforms', duration)


def use_apply(df):
    starttime = pd.datetime.now()
    df['Measures'] = df['Measures'].apply(np.array)
    endtime = pd.datetime.now()
    duration = endtime - starttime
    print('Apply', duration)


def run_test(tests):
    for i in range(tests):
        construct_df = get_dataframe()
        old_looping(construct_df)
    for i in range(tests):
        construct_df = get_dataframe()
        series_transforms(construct_df)
    for i in range(tests):
        construct_df = get_dataframe()
        use_apply(construct_df)

run_test(5)

With 10,000 rows the results were: Transforms 3.945816
Transforms 3.968821
Transforms 3.891866
Transforms 3.859437
Transforms 3.860590

Apply 4.218867
Apply 4.015742
Apply 4.046986
Apply 3.906360
Apply 3.890740

Looping 27.662418
Looping 27.814523
Looping 27.298895
Looping 27.565626
Looping 27.222970

Transforming through Series-List-NP Array-List-Series is negligibly faster than using Apply. Apply is definitely shorter code and possibly easier to understand.

Increasing the number of rows or array length will increase the times by the same magnitude.

rpanai
  • 12,515
  • 2
  • 42
  • 64
GeorgeLPerkins
  • 1,126
  • 10
  • 24

1 Answers1

2

Easiest might be to go with apply to convert to the np.array: df['Measures'].apply(np.array)

Full example:

import pandas as pd
import numpy as np

col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
display(df.Measures)

df['NParray'] = df['Measures'].apply(np.array)
df.drop(['Measures'], axis=1, inplace=True)

print(df)
print(type(df['NParray'][0]))
John Sloper
  • 1,813
  • 12
  • 14
  • So easy. I completely forgot about trying to use apply(). With that, I don't even have to create the new column and then drop the old one, I can simply do df['Measures'] = df['Measures'].apply(np.array) Thanks for the quick and simple answer. Still learning. – GeorgeLPerkins Mar 05 '19 at 21:02
  • Actually, I just read that apply might not be a good idea. https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code Especially with pd.Series. I'll have to run some large volume tests and see if I need to stick to df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures']))))) – GeorgeLPerkins Mar 05 '19 at 21:27