0

Hi my dataframe look like:

Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264

Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010-02-05", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not.

Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.

dcc
  • 1,729
  • 4
  • 16
  • 17

2 Answers2

3

You can first convert it to strings (the integer columns) before concatenating with +:

In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']

In [26]: df
Out[26]: 
   Store  Dept        Date  Sales              id
0      1     1  2010-02-05    245  1_1_2010-02-05
1      1     1  2010-02-12    449  1_1_2010-02-12
2      1     1  2010-02-19    455  1_1_2010-02-19
3      1     1  2010-02-26    154  1_1_2010-02-26
4      1     1  2010-03-05     29  1_1_2010-03-05
5      1     1  2010-03-12    239  1_1_2010-03-12
6      1     1  2010-03-19    264  1_1_2010-03-19

For the log, you better use the numpy function. This is vectorized (math.log can only work on single scalar values):

In [34]: df['logSales'] = np.log(df['Sales'])

In [35]: df
Out[35]: 
   Store  Dept        Date  Sales              id  logSales
0      1     1  2010-02-05    245  1_1_2010-02-05  5.501258
1      1     1  2010-02-12    449  1_1_2010-02-12  6.107023
2      1     1  2010-02-19    455  1_1_2010-02-19  6.120297
3      1     1  2010-02-26    154  1_1_2010-02-26  5.036953
4      1     1  2010-03-05     29  1_1_2010-03-05  3.367296
5      1     1  2010-03-12    239  1_1_2010-03-12  5.476464
6      1     1  2010-03-19    264  1_1_2010-03-19  5.575949

Summarizing the comments, for a dataframe of this size, using apply will not differ much in performance compared to using vectorized functions (working on the full column), but when your real dataframe becomes larger, it will.
Apart from that, I think the above solution is also easier syntax.

joris
  • 133,120
  • 36
  • 247
  • 202
  • I get 164us using math vs 151us using numpy log, I'm assuming that for a large dataframe numpy's one will eat Math's log for breakfast? – EdChum May 30 '14 at 09:11
  • Indeed, I get 201us (np) vs 208us (math), so almost the same for this dataframe, but for a larger one (this one 100 times repeated), numpy is clearly faster than using apply. – joris May 30 '14 at 09:12
  • For a dataframe with 7000 rows math.log takes 2.17ms versus np.log time of 240us so a significant performance improvement – EdChum May 30 '14 at 09:13
  • Also for the concatenation, for this dataframe, using apply is not slower (even a bit faster 500 vs 700 us), but for larger dataframes (7000 rows) it is again clearly slower (200 vs 80 ms). – joris May 30 '14 at 09:16
  • yes I would expect this too, good to know that the vectorised operations scale well, I still have more to learn about pandas and numpy ;) – EdChum May 30 '14 at 09:17
  • Regarding to the performance, I just notice if I use the vectorized functions, I may cause a memoryError (I have 3G ram) but apply does not have such a problem. So I think the vectorized functions are reading everything in memory right? The original file is about 12M in size, above is just a sample section of the file – dcc May 30 '14 at 11:34
  • BTW, I tried the same operation with R in the same computer, and R can handle the file very fast – dcc May 30 '14 at 11:38
  • If the file is only 12 Mb, you should not encounter MemoryErrors with the above code. If that is the case, there is clearly something wrong. – joris May 30 '14 at 12:22
  • @joris yeah, you are right, it is not due to insufficient RAM bcuz I tested it on a 8G ram machine as well. Instead, the problem seems to be cause by this statement: df['Store'].astype(str) +'_'+ df['Dept'].astype(str). It works file with a samll file with a few rows, but it crashed when rows are more than 100K. It is that line that causes the MemoryError. – dcc May 30 '14 at 13:04
  • I am not sure if I am using something wrongly, but even I do a simple mutiplication like: df['storedept'] = df['Store'].astype(int) * df['Dept'].astype(int), it causes a memoryError, could you tell me what is going wrong here? – dcc May 30 '14 at 13:18
  • I tried df['Store'].astype(str) with 700K rows, and it runs without any problems as expected (on a machine with 4G ram). Can you show `df.info()`, `pd.show_versions()`, your system specs and the full error message (or maybe better open a new question for it) – joris May 30 '14 at 13:27
  • Thanks, I open a new question here : http://stackoverflow.com/questions/23956390/python-pandas-memoryerror – dcc May 30 '14 at 14:00
2
In [153]:
import pandas as pd
import io

temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
   Store  Dept        Date  Sales
0      1     1  2010-02-05    245
1      1     1  2010-02-12    449
2      1     1  2010-02-19    455
3      1     1  2010-02-26    154
4      1     1  2010-03-05     29
5      1     1  2010-03-12    239
6      1     1  2010-03-19    264

[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x:  str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
   Store  Dept        Date  Sales              id
0      1     1  2010-02-05    245  1 1_2010-02-05
1      1     1  2010-02-12    449  1 1_2010-02-12
2      1     1  2010-02-19    455  1 1_2010-02-19
3      1     1  2010-02-26    154  1 1_2010-02-26
4      1     1  2010-03-05     29  1 1_2010-03-05
5      1     1  2010-03-12    239  1 1_2010-03-12
6      1     1  2010-03-19    264  1 1_2010-03-19

[7 rows x 5 columns]
In [155]:

import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
   Store  Dept        Date  Sales              id  logSales
0      1     1  2010-02-05    245  1 1_2010-02-05  5.501258
1      1     1  2010-02-12    449  1 1_2010-02-12  6.107023
2      1     1  2010-02-19    455  1 1_2010-02-19  6.120297
3      1     1  2010-02-26    154  1 1_2010-02-26  5.036953
4      1     1  2010-03-05     29  1 1_2010-03-05  3.367296
5      1     1  2010-03-12    239  1 1_2010-03-12  5.476464
6      1     1  2010-03-19    264  1 1_2010-03-19  5.575949

[7 rows x 6 columns]
Mark Hurd
  • 10,665
  • 10
  • 68
  • 101
EdChum
  • 376,765
  • 198
  • 813
  • 562