How to efficiently combine two columns into one column/ combine strings?

Question

I have two columns(A and Date) like below, and need to combine them into one column like column C. This dataset has more than 900,000 rows.

Then I met with two main problems.

The data type of column "Date" is timestamp, when I combine them with string type will cause error:

TypeError: unsupported operand type(s) for +: 'Timestamp' and 'str'.

The code is way too time-costing. I wrote a for loop to do the combination as below:

for i in range(0,911462): df['Combine'][i]=df['Date'][i]+df['A'][i]

I guess it is because using for-loop is doing the combination row by row, thus every single combination cost a lot of time on system IO.

Is there any method to do this job more efficiently?

Sorry I'm not familiar with the photo adding process, please click the hyperlink "enter image description here" to see the data description. — Vane Leung, Jun 21 '17 at 04:38

score 4 · Accepted Answer · answered Jun 21 '17 at 04:48

You have to explicitly case the Timestamp to a string e.g. with strftime:

In [11]: df = pd.DataFrame([[pd.Timestamp("2017-01-01"), 'a'], [pd.Timestamp("2017-01-02"), 'b']], columns=["A", "B"])

In [12]: df["A"].dt.strftime("%Y-%m-%d") + df["B"]
Out[12]:
0    2017-01-01a
1    2017-01-02b
dtype: object

Tiny.D · Answer 2 · 2017-06-21T04:57:59.660

3

Try with astype, it can cast object like Timestamp to string:

import pandas as pd
df = pd.DataFrame({'A':['XX','YY','ZZ','AA'], 'Date':[pd.Timestamp("2016-01-01"),pd.Timestamp('2016-01-15'),pd.Timestamp('2016-12-01'),pd.Timestamp('2016-07-12')]})
df['Combine'] = df['Date'].astype(str) + '_'+df['A']
df

df will be:

    A   Date        Combine
0   XX  2016-01-01  2016-01-01_XX
1   YY  2016-01-15  2016-01-15_YY
2   ZZ  2016-12-01  2016-12-01_ZZ
3   AA  2016-07-12  2016-07-12_AA

edited Jun 21 '17 at 04:57

answered Jun 21 '17 at 04:43

Tiny.D

6,466
2
15
20

@piRSquared well, i think OP just simply want to combine the `A` column (str) and `Date` column (Timestamp). – Tiny.D Jun 21 '17 at 05:31

piRSquared · Answer 3 · 2017-06-21T05:20:53.110

Setup

df = pd.DataFrame(dict(
        A='XX YY ZZ AA'.split(),
        Date=pd.date_range('2017-03-31', periods=4)
    ))

Option 1
apply with a lambda based on format and dictionary unpacking.
This is a slow, but cool way to do it.

df.assign(C=df.apply(lambda x: '{Date:%Y-%m-%d}_{A}'.format(**x), 1))

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

Option 2
numpy.core.defchararray.add
Very fast way to do it using 'datetime64[D]' to round to the day.

chr_add = np.core.defchararray.add

d = df.Date.values.astype('datetime64[D]').astype(str)
a = df.A.values.astype(str)
df.assign(C=chr_add(chr_add(d, '_'), a))

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

Option 3
Rip-off of @AndyHayden's answer with a small twist. I'll add my underscore '_' in the strftime... Mainly, this is what I'll use in timeit.

df.assign(C=df.Date.dt.strftime('%Y-%m-%d_') + df.A)

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

Timing

%%timeit
chr_add = np.core.defchararray.add

d = df.Date.values.astype('datetime64[D]').astype(str)
a = df.A.values.astype(str)
chr_add(chr_add(d, '_'), a)

%timeit df.assign(C=df.apply(lambda x: '{Date:%Y-%m-%d}_{A}'.format(**x), 1))
%timeit df.assign(C=df.Date.dt.strftime('%Y-%m-%d_') + df.A)

small data

10000 loops, best of 3: 53.2 µs per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 831 µs per loop

large data

df = pd.concat([df] * 10000, ignore_index=True)

10 loops, best of 3: 80.3 ms per loop
1 loop, best of 3: 4.58 s per loop
1 loop, best of 3: 233 ms per loop

score 0 · Answer 4 · answered Jun 21 '17 at 04:52

About 1., you can print the timestamp as a string

About 2. If you are planning to run this on a regular basis, you should consider to use map/reduce. MrJob is a tool written in python that allows you to run locally you map/reduce jobs, splitting them into multiple jobs that run in parallel. Check the examples, your script should be very simple. Important note: this works only if your are not worry about the rows order and is useful only if you have more than one core available.

Best.

How to efficiently combine two columns into one column/ combine strings?

4 Answers4