Python Pandas - Concatenate two Pandas column Efficiently

Question

I am looking for the most memory efficient way to concatenate an Int 32 and Datetime column to create a 3rd column. I have two columns in a Dataframe an int32 and a datetime64. I want to create a 3rd column which will .

The dataframe looks like this:

What I want is:

I have created a test data frame as follows:

import pandas as pd
import numpy as np
import sys
import datetime as dt
%load_ext memory_profiler
np.random.seed(42)
df_rows = 10**6
todays_date = dt.datetime.now().date()
dt_array = pd.date_range(todays_date - dt.timedelta(2*365), periods=2*365, freq='D')  
cust_id_array = np.random.randint(100000,999999,size=(100000, 1))
df = pd.DataFrame({'cust_id':np.random.choice(cust_id_array.flatten(),df_rows,replace=True)
                  ,'tran_dt':np.random.choice(dt_array,df_rows,replace=True)})
df.info()

The dataframe statistics as-is before concatenation are:

I have used both map and astype to concatenate but the memory usage is still quite high:

%memit -r 1 df['comb_key'] = df["cust_id"].map(str) + '----' + df["tran_dt"].map(str)

%memit -r 1 df['comb_key'] = df["cust_id"].astype(str) + '----' + df["tran_dt"].astype(str)

%memit -r 1 df['comb_key'] = df.apply(lambda x:  str(str(x['cust_id']) \
+ '----' + dt.datetime.strftime(x['tran_dt'],'%Y-%m-%d')), axis=1)

The memory usage for the 3 are:

Is there a more memory efficient way of doing this? My real life data sets are about 1.8 GB's uncompressed on a machine with 16GB RAM.

I'd favour the middle one, you could also try `df['cust_id'] + '----' + df['tran_dt'].dt.strftime('%Y-%m-%d')` but I wouldn't expect much difference — EdChum, May 12 '17 at 09:31
The most memory efficient way would to be *not* to store it. Is that a firm requirement - or can it be generated on the fly when needed? Or - can it be generated pre-dataframe creation and the other columns dropped etc...? If you need all of them to exist simultaneously there isn't a memory efficient way of doing so. — Jon Clements, May 12 '17 at 10:04
[Related question](http://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python) — MaxU - stand with Ukraine, May 12 '17 at 10:11

Ken Wei · Answer 1 · 2017-05-12T10:19:53.193

df['comb_key'] = df["cust_id"].astype(str) + '----' + df["tran_dt"].astype(str)

is computationally the fastest method, as you're effectively only performing one typecast for each element of data, and pretty much all of this takes place in C.

So if you're running into memory issues, you'll have to do this in sections, for example with 2:

%%memit
df['comb_key'] = ''
df.comb_key.update(df["cust_id"].iloc[:500000].astype(str) + '----' + df["tran_dt"].iloc[:500000].astype(str))
df.comb_key.update(df["cust_id"].iloc[500000:].astype(str) + '----' + df["tran_dt"].iloc[500000:].astype(str))

# peak memory: 253.06 MiB, increment: 63.25 MiB

Note that the new column consumes 65MB of memory:

df.memory_usage(deep=True)

# Index             72
# cust_id      8000000
# tran_dt      8000000
# comb_key    65000000
# dtype: int64

So make sure that you have enough memory to store the result in the first place! However, it's probably important to note that if you're having memory issues performing this operation but somehow just enough to store the result, chances are that you won't have enough memory left to do more work on your dataframe either.

Thanks. So the data set after the operation becomes 16MB to about 80 MB. But the issue seems like that of peak memory usage. which is about 20 times the size of original data set. My original Data Set is about 1.6 GB. So essentially I cannot do this operation without a machine of 32 GB RAM? — Partha Sarkar, May 12 '17 at 13:21
The smaller example 5x's the size of the dataframe, so I would roughly guess that your original data set would expand to take up about 5x1.6=8GB of memory. Assuming this is correct, you could still manage with a computer with 8GB of RAM, just that your system will slow down as it starts to utilize the swap space. — Ken Wei, May 12 '17 at 16:27

Python Pandas - Concatenate two Pandas column Efficiently

1 Answers1