2

This is somewhat the reverse of things that most people would like to do when converting between lists and dataframes.

I am looking to convert a large dataframe (10M+ rows, 20+ columns) to a list of strings, where each entry is a string representation of each row in the dataframe. I can do this using pandas' to_csv() method, but I'm wondering if there is a faster way as this is proving to be a bottleneck in my code.

Minimum working example:

import numpy as np
import pandas as pd

# Create the initial dataframe.
size = 10000000
cols = list('abcdefghijklmnopqrstuvwxyz')
df = pd.DataFrame()
for col in cols:
    df[col] = np.arange(size)
    df[col] = "%s_" % col + df[col].astype(str)

# Convert to the required list structure
ret_val = _df_.to_csv(index=False, header=False).split("\n")[:-1]

The conversion aspect of the above code takes around ~90 seconds for a dataframe of 10,000,000 rows on a single thread of my Core i9, and is highly CPU dependent. I'd love to reduce that down by an order of magnitude if at all possible.

EDIT: I'm not looking to save the data to a .csv or to a file. I'm just looking to convert the dataframe to an array of strings.

EDIT: Input/Output example with only 5 columns:

In  [1]: df.head(10)
Out [1]:    a       b       c       d       e
         0  a_0     b_0     c_0     d_0     e_0
         1  a_1     b_1     c_1     d_1     e_1
         2  a_2     b_2     c_2     d_2     e_2
         3  a_3     b_3     c_3     d_3     e_3
         4  a_4     b_4     c_4     d_4     e_4
         5  a_5     b_5     c_5     d_5     e_5
         6  a_6     b_6     c_6     d_6     e_6
         7  a_7     b_7     c_7     d_7     e_7
         8  a_8     b_8     c_8     d_8     e_8
         9  a_9     b_9     c_9     d_9     e_9

In  [2]: ret_val[:10]
Out [2]: ['a_0,b_0,c_0,d_0,e_0',
          'a_1,b_1,c_1,d_1,e_1',
          'a_2,b_2,c_2,d_2,e_2',
          'a_3,b_3,c_3,d_3,e_3',
          'a_4,b_4,c_4,d_4,e_4',
          'a_5,b_5,c_5,d_5,e_5',
          'a_6,b_6,c_6,d_6,e_6',
          'a_7,b_7,c_7,d_7,e_7',
          'a_8,b_8,c_8,d_8,e_8',
          'a_9,b_9,c_9,d_9,e_9']
MikeFenton
  • 307
  • 2
  • 10
  • 1
    why would you want to do that? I'd try my hardest to keep that much data mostly out of RAM, certainly parsed into appropriate data types so it can all be operated on efficiently – Sam Mason Sep 09 '19 at 17:56
  • Need it for entropy and information comparisons between two lists of strings. – MikeFenton Sep 10 '19 at 08:50

4 Answers4

2

I get ~2.5 times speedup with multiprocessing...

import multiprocessing

# df from OPs above code available in global scope

def fn(i):
    return df[i:i+1000].to_csv(index=False, header=False).split('\n')[:-1]

with multiprocessing.Pool() as pool:
    result = []
    for a in pool.map(fn, range(0, len(df), 1000)):
        result.extend(a)

reduces overall time for 1M rows from 6.8 secs to 2.8 secs on my laptop, so would hopefully scale to more cores in an i9 CPU.

This depends on Unix fork semantics to share the dataframe with child processes, and obviously does a bit more work, but might help...

using numpy.savetxt suggestion from Massifox with multiprocessing takes this down to 2.0 seconds, just map the following function:

def fn2(i):
    with StringIO() as fd:
        np.savetxt(fd, df[i:i+N], fmt='%s', delimiter=',')
        return fd.getvalue().split('\n')[:-1]

result is otherwise basically the same

your comment that says "the dataframe is a variable within a class" can be fixed in a variety of different ways. a simple way would just be to pass the dataframe to the Pool initializer at which point it won't be picked (under Unix anyway) and stash a reference to it in a global variable somewhere. this reference can then be used by each worker processes, e.g:

def stash_df(df):
    global the_df
    the_df = df

def fn(i):
    with StringIO() as fd:
        np.savetxt(fd, the_df[i:i+N], fmt='%s', delimiter=',')
        return fd.getvalue().split('\n')[:-1]

with multiprocessing.Pool(initializer=stash_df, initargs=(df,)) as pool:
    result = []
    for a in pool.map(fn, range(0, len(df), N)):
        result.extend(a)

this will be fine as long as each Pool is used by a single dataframe

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • For some reason the multiprocessing implementation is far slower at scale. I suspect this is due to the piping of the large dataframe chunks between the parent and the different child processes. However, the `numpy.savetext` method gives a near 2x speedup which is greatly appreciated! – MikeFenton Sep 10 '19 at 14:26
  • assuming you can rely on forking; the only appreciable IPC should be returning [pickled](https://docs.python.org/3/library/pickle.html) lists of strings from child to parent processes. you want to arrange things so that the data frame ends up available in the child processes after spawning the Pool (i.e. not passed as an argument to the `map`ped function) – Sam Mason Sep 11 '19 at 08:41
  • The problem is that the dataframe is a variable within a class. If the method to be paralellized is part of the class, it can see the dataframe instance but cannot be pickled. In order to pickle the method, it cannot be a class instance, and thus cannot see the dataframe without it being explicitly passed in to it... – MikeFenton Sep 11 '19 at 15:08
1

You can try different ways to speed up data writing to disk:

  1. Writing compressed file could speedup writing up to 10x

    df.to_csv('output.csv.gz' , header=True , index=False , chunksize=100000 , compression='gzip' , encoding='utf-8')
    Choose the chunksize that works best for you.

  2. Switch to hdf format:

    df.to_hdf(r'output.h5', mode='w')

  3. According to krassowski answer, using numpy. For example, using the following df:

    df=pd.DataFrame({'A':range(1000000)}) df['B'] = df.A + 1.0 df['C'] = df.A + 2.0 df['D'] = df.A + 3.0

    Pandas to csv:

    df.to_csv('pandas_to_csv', index=False)
    On my computer, takes 6.45 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)`

    Numpy to csv:

    savetxt( 'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f', header=','.join(aa.columns), comments='')
    On my computer, takes 3.38 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  4. Using Pandaral·lel.
    Is a simple and efficient tool to parallelize your Pandas computation on all your CPUs (Linux & MacOS only). How to significantly speed up your pandas computation with only one line of code. Cool!

  5. You can think of replacing Pandas dataframe with DASK dataframe. The CSV API are very similar to pandas.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Massifox
  • 4,369
  • 11
  • 31
  • Hi, thanks for the input but I'm not looking to save data to file. I'm only looking to create a list of strings. – MikeFenton Sep 10 '19 at 08:44
0

Using dictionaries gives a slight improvement in performance:

size = 100000
cols = list('abcdefghijklmnopqrstuvwxyz')

Dict Version:

%%timeit
dict_res= {}
for col in cols:
    dict_res[col] = ["%s_%d" % (col, n) for n in np.arange(size)]
df2 = pd.DataFrame(dict_res)
# 1.56 s ± 99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Yuor example:

%%timeit
df = pd.DataFrame()
for col in cols:
    df[col] = np.arange(size)
    df[col] = "%s_" % col + df[col].astype(str)
# 1.91 s ± 84.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

multiprocessing

Using multiprocessing, the code would be the following:

import multiprocessing
import numpy as np
import pandas pd

size = 100000
cols = list('abcdefghijklmnopqrstuvwxyz')
n_core = muliprocessing.cpu_count()

def format_col(col):
    return col, ["%s_%d" % (col, n) for n in np.arange(size)]

with multiprocessing.Pool(n_core) as pool:
    result = {}
    for res in pool.map(format_col, cols):
        result[res[0]]=res[1]
        result.extend(res)
df = pd.DataFrame(result)

Now I can't run it on my pc. But the performance will probably improve.

Massifox
  • 4,369
  • 11
  • 31
  • This is speeding up the generation of the dataframe, which is not the question/problem. I need to speed up the conversion of the dataframe to a list of strings. – MikeFenton Sep 10 '19 at 12:04
0

Try this solution:

    list_of_string = df.head(5).set_index(cols[0]).to_string(header=False).split('\n')[1:]

     # output: 
['a_0  b_1  c_1  d_1  e_1  f_1  g_1  h_1  i_1  j_1  k_1  l_1  m_1  n_1  o_1  p_1  q_1  r_1  s_1  t_1  u_1  v_1  w_1  x_1  y_1  z_1',
     'a_1  b_2  c_2  d_2  e_2  f_2  g_2  h_2  i_2  j_2  k_2  l_2  m_2  n_2  o_2  p_2  q_2  r_2  s_2  t_2  u_2  v_2  w_2  x_2  y_2  z_2',
     'a_2  b_3  c_3  d_3  e_3  f_3  g_3  h_3  i_3  j_3  k_3  l_3  m_3  n_3  o_3  p_3  q_3  r_3  s_3  t_3  u_3  v_3  w_3  x_3  y_3  z_3',
     'a_3  b_4  c_4  d_4  e_4  f_4  g_4  h_4  i_4  j_4  k_4  l_4  m_4  n_4  o_4  p_4  q_4  r_4  s_4  t_4  u_4  v_4  w_4  x_4  y_4  z_4',
     'a_4  b_5  c_5  d_5  e_5  f_5  g_5  h_5  i_5  j_5  k_5  l_5  m_5  n_5  o_5  p_5  q_5  r_5  s_5  t_5  u_5  v_5  w_5  x_5  y_5  z_5']

If you want replace white space with comma:

[s.replace('  ', ',') for s in list_of_string]
# output:
['a_0,b_1,c_1,d_1,e_1,f_1,g_1,h_1,i_1,j_1,k_1,l_1,m_1,n_1,o_1,p_1,q_1,r_1,s_1,t_1,u_1,v_1,w_1,x_1,y_1,z_1',
 'a_1,b_2,c_2,d_2,e_2,f_2,g_2,h_2,i_2,j_2,k_2,l_2,m_2,n_2,o_2,p_2,q_2,r_2,s_2,t_2,u_2,v_2,w_2,x_2,y_2,z_2',
 'a_2,b_3,c_3,d_3,e_3,f_3,g_3,h_3,i_3,j_3,k_3,l_3,m_3,n_3,o_3,p_3,q_3,r_3,s_3,t_3,u_3,v_3,w_3,x_3,y_3,z_3',
 'a_3,b_4,c_4,d_4,e_4,f_4,g_4,h_4,i_4,j_4,k_4,l_4,m_4,n_4,o_4,p_4,q_4,r_4,s_4,t_4,u_4,v_4,w_4,x_4,y_4,z_4',
 'a_4,b_5,c_5,d_5,e_5,f_5,g_5,h_5,i_5,j_5,k_5,l_5,m_5,n_5,o_5,p_5,q_5,r_5,s_5,t_5,u_5,v_5,w_5,x_5,y_5,z_5']

You can speed up this code with the advice I gave you in previous answers.

Tips: DASK, Pandaral·lel and multiprocessing is your friends!

Massifox
  • 4,369
  • 11
  • 31