0

I tried to concatenate a large number of columns containing integers in one string.

Basically, starting from:

df = pd.DataFrame({'id':[1,2,3,4],'a':[0,1,2,3], 'b':[4,5,6,7], 'c':[8,9,0,1]})

To obtain:

id join
0   1  481
1   2  592
2   3  603
3   4  714

I found several methods to do this (here and here):

Method 1:

conc['glued']=''
i=1
while i < len(df.columns):
        conc['glued'] = conc['glued'] + df[df.columns[i]].values.astype(str)
        i=i+1

This method work, but is a bit long (45min on my "test" case of 18,000 rows x 40,000 columns). I am concerned by the loop on the columns as this program should be applied at the end on tables of 600.000 columns and I am afraid it will be too long.

Method 2a

conc['join']=[''.join(row) for row in df[df.columns[1:]].values.astype(str)]

Method 2b

conc['apply'] = df[df.columns[1:]].apply(lambda x: ''.join(x.astype(str)), axis=1)

Both of these methods are 10 times more efficient than the previous one, iterate on rows which is good and work perfectly on my "debug" table df. But, when I apply it to my "test" table of 18k x 40k, it leads to a MemoryError: (I have 60% of my 32GB of RAM occupied after reading the corresponding csv file). I can copy my DataFrame without overpass the memory, but curiously, applying this method make the code crash.

Do you see how I can fix and improve this code to use an efficient row based iteration? Thank you !


Appendix: Here is the code I use on my test case:

geno_reader = pd.read_csv(genotype_file,header=0,compression='gzip', usecols=geno_columns_names)
fimpute_geno = pd.DataFrame({'SampID': geno_reader['SampID']})

I should use the chunksize option to read this file but I haven't yet really understand how to use it after reading.

Method 1:

fimpute_geno['Calls'] = ''
for i in range(1,len(geno_reader.columns)):
    fimpute_geno['Calls'] = fimpute_geno['Calls']\
                      + geno_reader[geno_reader.columns[i]].values.astype(int).astype(str)

This work in 45min. There is some quite disgusting piece of code like the .astype(int).astype(str). I don't know why Python don't recognize my integers and consider them as float.

Method 2:

fimpute_geno['Calls'] = geno_reader[geno_reader.columns[1:]]\
                    .apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)

This leads to an MemoryError:

Community
  • 1
  • 1
Romn
  • 174
  • 2
  • 14
  • You are doing some odd things: `fimpute_geno = pd.DataFrame({'SampID': geno_reader['SampID']})` why create a a new df? could you not just do `fimpute_geno = geno_reader['SampID']`? or do you need a copy? The reason that the dtype is a float is probably because you have missing data `NaN` which cannot be represented as an int, you need to decide whether to replace these values or drop the rows, also what are you trying to achieve, it looks like you are trying to build a string representation of your data – EdChum Feb 15 '15 at 21:26
  • I don't use `fimpute_geno = geno_reader['SampID']` because this will just give me a serie, or I need a dataframe to put other informations. I also suspected some NaN but it seems there aren't any. – Romn Feb 16 '15 at 01:50

1 Answers1

0

Here' something to try. It would require that you convert your columns to strings though. your sample frame

    b   c   id
0   4   8   1
1   5   9   2
2   6   0   3
3   7   1   4

then

#you could also do this conc[['b','c','id']] for the next two lines  
conc.ix[:,'b':'id'] = conc.ix[:,'b':'id'].astype('str')  
conc['join'] = np.sum(conc.ix[:,'b':'id'],axis=1)

Would give

    a   b   c   id  join
0   0   4   8   1   481
1   1   5   9   2   592
2   2   6   0   3   603
3   3   7   1   4   714
Bob Haffner
  • 8,235
  • 1
  • 36
  • 43
  • Your proposition makes me understand something : it is not concatenate things with takes time, it is to convert it in string. When I do `geno_test.ix[:,geno_test.columns[1:]] = geno_test.ix[:,geno_test.columns[1:]].astype('str')` (with a limited 20 rows x 40,000 columns table), it takes like a dozen of minute to accomplish. Then, when I do `fimpute_geno['Calls'] = np.sum(geno_test.ix[:,geno_test.columns[1:]],axis=1)`, it is quasi-instantaneous. – Romn Feb 16 '15 at 02:48
  • Yeah, thats a lot of columns to convert. I'm guessing there's a lot of copying going on when dtypes are changed. Maybe you should modify your question or post a new one about the data types part – Bob Haffner Feb 16 '15 at 13:10