This is somewhat the reverse of things that most people would like to do when converting between lists and dataframes.
I am looking to convert a large dataframe (10M+ rows, 20+ columns) to a list of strings, where each entry is a string representation of each row in the dataframe. I can do this using pandas' to_csv()
method, but I'm wondering if there is a faster way as this is proving to be a bottleneck in my code.
Minimum working example:
import numpy as np
import pandas as pd
# Create the initial dataframe.
size = 10000000
cols = list('abcdefghijklmnopqrstuvwxyz')
df = pd.DataFrame()
for col in cols:
df[col] = np.arange(size)
df[col] = "%s_" % col + df[col].astype(str)
# Convert to the required list structure
ret_val = _df_.to_csv(index=False, header=False).split("\n")[:-1]
The conversion aspect of the above code takes around ~90 seconds for a dataframe of 10,000,000 rows on a single thread of my Core i9, and is highly CPU dependent. I'd love to reduce that down by an order of magnitude if at all possible.
EDIT: I'm not looking to save the data to a .csv or to a file. I'm just looking to convert the dataframe to an array of strings.
EDIT: Input/Output example with only 5 columns:
In [1]: df.head(10)
Out [1]: a b c d e
0 a_0 b_0 c_0 d_0 e_0
1 a_1 b_1 c_1 d_1 e_1
2 a_2 b_2 c_2 d_2 e_2
3 a_3 b_3 c_3 d_3 e_3
4 a_4 b_4 c_4 d_4 e_4
5 a_5 b_5 c_5 d_5 e_5
6 a_6 b_6 c_6 d_6 e_6
7 a_7 b_7 c_7 d_7 e_7
8 a_8 b_8 c_8 d_8 e_8
9 a_9 b_9 c_9 d_9 e_9
In [2]: ret_val[:10]
Out [2]: ['a_0,b_0,c_0,d_0,e_0',
'a_1,b_1,c_1,d_1,e_1',
'a_2,b_2,c_2,d_2,e_2',
'a_3,b_3,c_3,d_3,e_3',
'a_4,b_4,c_4,d_4,e_4',
'a_5,b_5,c_5,d_5,e_5',
'a_6,b_6,c_6,d_6,e_6',
'a_7,b_7,c_7,d_7,e_7',
'a_8,b_8,c_8,d_8,e_8',
'a_9,b_9,c_9,d_9,e_9']