4

I have two columns(A and Date) like below, and need to combine them into one column like column C. This dataset has more than 900,000 rows.

enter image description here

Then I met with two main problems.

  1. The data type of column "Date" is timestamp, when I combine them with string type will cause error:

TypeError: unsupported operand type(s) for +: 'Timestamp' and 'str'.

  1. The code is way too time-costing. I wrote a for loop to do the combination as below:

    for i in range(0,911462): df['Combine'][i]=df['Date'][i]+df['A'][i]

I guess it is because using for-loop is doing the combination row by row, thus every single combination cost a lot of time on system IO.

Is there any method to do this job more efficiently?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Vane Leung
  • 65
  • 4
  • Sorry I'm not familiar with the photo adding process, please click the hyperlink "enter image description here" to see the data description. – Vane Leung Jun 21 '17 at 04:38

4 Answers4

4

You have to explicitly case the Timestamp to a string e.g. with strftime:

In [11]: df = pd.DataFrame([[pd.Timestamp("2017-01-01"), 'a'], [pd.Timestamp("2017-01-02"), 'b']], columns=["A", "B"])

In [12]: df["A"].dt.strftime("%Y-%m-%d") + df["B"]
Out[12]:
0    2017-01-01a
1    2017-01-02b
dtype: object
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
3

Try with astype, it can cast object like Timestamp to string:

import pandas as pd
df = pd.DataFrame({'A':['XX','YY','ZZ','AA'], 'Date':[pd.Timestamp("2016-01-01"),pd.Timestamp('2016-01-15'),pd.Timestamp('2016-12-01'),pd.Timestamp('2016-07-12')]})
df['Combine'] = df['Date'].astype(str) + '_'+df['A']
df

df will be:

    A   Date        Combine
0   XX  2016-01-01  2016-01-01_XX
1   YY  2016-01-15  2016-01-15_YY
2   ZZ  2016-12-01  2016-12-01_ZZ
3   AA  2016-07-12  2016-07-12_AA
Tiny.D
  • 6,466
  • 2
  • 15
  • 20
  • @piRSquared well, i think OP just simply want to combine the `A` column (str) and `Date` column (Timestamp). – Tiny.D Jun 21 '17 at 05:31
2

Setup

df = pd.DataFrame(dict(
        A='XX YY ZZ AA'.split(),
        Date=pd.date_range('2017-03-31', periods=4)
    ))

Option 1
apply with a lambda based on format and dictionary unpacking.
This is a slow, but cool way to do it.

df.assign(C=df.apply(lambda x: '{Date:%Y-%m-%d}_{A}'.format(**x), 1))

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

Option 2
numpy.core.defchararray.add
Very fast way to do it using 'datetime64[D]' to round to the day.

chr_add = np.core.defchararray.add

d = df.Date.values.astype('datetime64[D]').astype(str)
a = df.A.values.astype(str)
df.assign(C=chr_add(chr_add(d, '_'), a))

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

Option 3
Rip-off of @AndyHayden's answer with a small twist. I'll add my underscore '_' in the strftime... Mainly, this is what I'll use in timeit.

df.assign(C=df.Date.dt.strftime('%Y-%m-%d_') + df.A)

    A       Date              C
0  XX 2017-03-31  2017-03-31_XX
1  YY 2017-04-01  2017-04-01_YY
2  ZZ 2017-04-02  2017-04-02_ZZ
3  AA 2017-04-03  2017-04-03_AA

Timing

%%timeit
chr_add = np.core.defchararray.add

d = df.Date.values.astype('datetime64[D]').astype(str)
a = df.A.values.astype(str)
chr_add(chr_add(d, '_'), a)

%timeit df.assign(C=df.apply(lambda x: '{Date:%Y-%m-%d}_{A}'.format(**x), 1))
%timeit df.assign(C=df.Date.dt.strftime('%Y-%m-%d_') + df.A)

small data

10000 loops, best of 3: 53.2 µs per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 831 µs per loop

large data

df = pd.concat([df] * 10000, ignore_index=True)

10 loops, best of 3: 80.3 ms per loop
1 loop, best of 3: 4.58 s per loop
1 loop, best of 3: 233 ms per loop
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

About 1., you can print the timestamp as a string

About 2. If you are planning to run this on a regular basis, you should consider to use map/reduce. MrJob is a tool written in python that allows you to run locally you map/reduce jobs, splitting them into multiple jobs that run in parallel. Check the examples, your script should be very simple. Important note: this works only if your are not worry about the rows order and is useful only if you have more than one core available.

Best.

afaundez
  • 73
  • 1
  • 6