1

I am looking for the most memory efficient way to concatenate an Int 32 and Datetime column to create a 3rd column. I have two columns in a Dataframe an int32 and a datetime64. I want to create a 3rd column which will .

The dataframe looks like this:

enter image description here

What I want is:

enter image description here

I have created a test data frame as follows:

import pandas as pd
import numpy as np
import sys
import datetime as dt
%load_ext memory_profiler
np.random.seed(42)
df_rows = 10**6
todays_date = dt.datetime.now().date()
dt_array = pd.date_range(todays_date - dt.timedelta(2*365), periods=2*365, freq='D')  
cust_id_array = np.random.randint(100000,999999,size=(100000, 1))
df = pd.DataFrame({'cust_id':np.random.choice(cust_id_array.flatten(),df_rows,replace=True)
                  ,'tran_dt':np.random.choice(dt_array,df_rows,replace=True)})
df.info()

The dataframe statistics as-is before concatenation are: enter image description here

I have used both map and astype to concatenate but the memory usage is still quite high:

%memit -r 1 df['comb_key'] = df["cust_id"].map(str) + '----' + df["tran_dt"].map(str)

%memit -r 1 df['comb_key'] = df["cust_id"].astype(str) + '----' + df["tran_dt"].astype(str)

%memit -r 1 df['comb_key'] = df.apply(lambda x:  str(str(x['cust_id']) \
+ '----' + dt.datetime.strftime(x['tran_dt'],'%Y-%m-%d')), axis=1)

The memory usage for the 3 are: enter image description here

Is there a more memory efficient way of doing this? My real life data sets are about 1.8 GB's uncompressed on a machine with 16GB RAM.

Partha Sarkar
  • 357
  • 1
  • 2
  • 9
  • I'd favour the middle one, you could also try `df['cust_id'] + '----' + df['tran_dt'].dt.strftime('%Y-%m-%d')` but I wouldn't expect much difference – EdChum May 12 '17 at 09:31
  • The most memory efficient way would to be *not* to store it. Is that a firm requirement - or can it be generated on the fly when needed? Or - can it be generated pre-dataframe creation and the other columns dropped etc...? If you need all of them to exist simultaneously there isn't a memory efficient way of doing so. – Jon Clements May 12 '17 at 10:04
  • [Related question](http://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python) – MaxU - stand with Ukraine May 12 '17 at 10:11

1 Answers1

0
df['comb_key'] = df["cust_id"].astype(str) + '----' + df["tran_dt"].astype(str)

is computationally the fastest method, as you're effectively only performing one typecast for each element of data, and pretty much all of this takes place in C.

So if you're running into memory issues, you'll have to do this in sections, for example with 2:

%%memit
df['comb_key'] = ''
df.comb_key.update(df["cust_id"].iloc[:500000].astype(str) + '----' + df["tran_dt"].iloc[:500000].astype(str))
df.comb_key.update(df["cust_id"].iloc[500000:].astype(str) + '----' + df["tran_dt"].iloc[500000:].astype(str))

# peak memory: 253.06 MiB, increment: 63.25 MiB

Note that the new column consumes 65MB of memory:

df.memory_usage(deep=True)

# Index             72
# cust_id      8000000
# tran_dt      8000000
# comb_key    65000000
# dtype: int64

So make sure that you have enough memory to store the result in the first place! However, it's probably important to note that if you're having memory issues performing this operation but somehow just enough to store the result, chances are that you won't have enough memory left to do more work on your dataframe either.

Ken Wei
  • 3,020
  • 1
  • 10
  • 30
  • Thanks. So the data set after the operation becomes 16MB to about 80 MB. But the issue seems like that of peak memory usage. which is about 20 times the size of original data set. My original Data Set is about 1.6 GB. So essentially I cannot do this operation without a machine of 32 GB RAM? – Partha Sarkar May 12 '17 at 13:21
  • The smaller example 5x's the size of the dataframe, so I would roughly guess that your original data set would expand to take up about 5x1.6=8GB of memory. Assuming this is correct, you could still manage with a computer with 8GB of RAM, just that your system will slow down as it starts to utilize the swap space. – Ken Wei May 12 '17 at 16:27