Adding calculated column(s) to a dataframe in pandas

Question

I have an OHLC price data set, that I have parsed from CSV into a Pandas dataframe and resampled to 15 min bars:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 500047 entries, 1998-05-04 04:45:00 to 2012-08-07 00:15:00
Freq: 15T
Data columns:
Close    363152  non-null values
High     363152  non-null values
Low      363152  non-null values
Open     363152  non-null values
dtypes: float64(4)

I would like to add various calculated columns, starting with simple ones such as period Range (H-L) and then booleans to indicate the occurrence of price patterns that I will define - e.g. a hammer candle pattern, for which a sample definition:

def closed_in_top_half_of_range(h,l,c):
    return c > l + (h-l)/2

def lower_wick(o,l,c):
    return min(o,c)-l

def real_body(o,c):
    return abs(c-o)

def lower_wick_at_least_twice_real_body(o,l,c):
    return lower_wick(o,l,c) >= 2 * real_body(o,c)

def is_hammer(row):
    return lower_wick_at_least_twice_real_body(row["Open"],row["Low"],row["Close"]) \
    and closed_in_top_half_of_range(row["High"],row["Low"],row["Close"])

Basic problem: how do I map the function to the column, specifically where I would like to reference more than one other column or the whole row or whatever?

This post deals with adding two calculated columns off of a single source column, which is close, but not quite it.

And slightly more advanced: for price patterns that are determined with reference to more than a single bar (T), how can I reference different rows (e.g. T-1, T-2 etc.) from within the function definition?

BrenBarn · Accepted Answer · 2012-09-11T20:56:40.427

The exact code will vary for each of the columns you want to do, but it's likely you'll want to use the map and apply functions. In some cases you can just compute using the existing columns directly, since the columns are Pandas Series objects, which also work as Numpy arrays, which automatically work element-wise for usual mathematical operations.

>>> d
    A   B  C
0  11  13  5
1   6   7  4
2   8   3  6
3   4   8  7
4   0   1  7
>>> (d.A + d.B) / d.C
0    4.800000
1    3.250000
2    1.833333
3    1.714286
4    0.142857
>>> d.A > d.C
0     True
1     True
2     True
3    False
4    False

If you need to use operations like max and min within a row, you can use apply with axis=1 to apply any function you like to each row. Here's an example that computes min(A, B)-C, which seems to be like your "lower wick":

>>> d.apply(lambda row: min([row['A'], row['B']])-row['C'], axis=1)
0    6
1    2
2   -3
3   -3
4   -7

Hopefully that gives you some idea of how to proceed.

Edit: to compare rows against neighboring rows, the simplest approach is to slice the columns you want to compare, leaving off the beginning/end, and then compare the resulting slices. For instance, this will tell you for which rows the element in column A is less than the next row's element in column C:

d['A'][:-1] < d['C'][1:]

and this does it the other way, telling you which rows have A less than the preceding row's C:

d['A'][1:] < d['C'][:-1]

Doing ['A"][:-1] slices off the last element of column A, and doing ['C'][1:] slices off the first element of column C, so when you line these two up and compare them, you're comparing each element in A with the C from the following row.

score 48 · Answer 2 · answered Sep 11 '12 at 20:04

48

You could have is_hammer in terms of row["Open"] etc. as follows

def is_hammer(rOpen,rLow,rClose,rHigh):
    return lower_wick_at_least_twice_real_body(rOpen,rLow,rClose) \
       and closed_in_top_half_of_range(rHigh,rLow,rClose)

Then you can use map:

df["isHammer"] = map(is_hammer, df["Open"], df["Low"], df["Close"], df["High"])

answered Sep 11 '12 at 20:04

Andy Hayden

359,921
101
625
535

2

Also useful, many thanks. Many ways to skin a cat and all. I would give you an upvote, but this is my first ever question on StackOverflow and I sadly don't have enough rep. I don't suppose you have nny ideas on the second part, viz referencing neighbouring rows in the dataframe from within the map/apply function? Cheers again. – ultra909 Sep 11 '12 at 20:43
1

For some reason, the method using `map(f, col1, col2)` is way faster than `df.apply(..., axis=1)`. map takes 0.35s vs df.apply taking 26s for a 1M row dataframe. Any idea why? (python 2.7 and pandas 0.18.1) – MohamedEzz Feb 16 '18 at 14:33
Sometimes this works great, but sometimes I get the warning: "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"... Any insight? – elomage Aug 30 '18 at 12:43
1

@elomage it's unclear without more info. Perhaps early on you're slicing our some of the dataframe then assigning within this view. You can make the view a copy using `df = df.copy()` above... but this may be better answered as a new question. – Andy Hayden Aug 30 '18 at 16:26
@AndyHayden df.copy() did the trick, thanks. I did some slicing before. For smaller dataframes there was no problem, for larger I had to do the copy(). – elomage Aug 31 '18 at 12:26
2

I'm quite a new to Python, but this map function should be wrapped by list function I guess, or else it will return containter object instead of list of values – Павел May 19 '19 at 05:53

score 5 · Answer 3 · answered Aug 16 '14 at 12:52

5

For the second part of your question, you can also use shift, for example:

df['t-1'] = df['t'].shift(1)

t-1 would then contain the values from t one row above.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html

answered Aug 16 '14 at 12:52

fantabolous

21,470
7
54
51

JoeCondron · Answer 4 · 2015-03-06T17:34:27.420

The first four functions you list will work on vectors as well, with the exception that lower_wick needs to be adapted. Something like this,

def lower_wick_vec(o, l, c):
    min_oc = numpy.where(o > c, c, o)
    return min_oc - l

where o, l and c are vectors. You could do it this way instead which just takes the df as input and avoid using numpy, although it will be much slower:

def lower_wick_df(df):
    min_oc = df[['Open', 'Close']].min(axis=1)
    return min_oc - l

The other three will work on columns or vectors just as they are. Then you can finish off with

def is_hammer(df):
    lw = lower_wick_at_least_twice_real_body(df["Open"], df["Low"], df["Close"]) 
    cl = closed_in_top_half_of_range(df["High"], df["Low"], df["Close"])
    return cl & lw

Bit operators can perform set logic on boolean vectors, & for and, | for or etc. This is enough to completely vectorize the sample calculations you gave and should be relatively fast. You could probably speed up even more by temporarily working with the numpy arrays underlying the data while performing these calculations.

For the second part, I would recommend introducing a column indicating the pattern for each row and writing a family of functions which deal with each pattern. Then groupby the pattern and apply the appropriate function to each group.

Adding calculated column(s) to a dataframe in pandas

4 Answers4

Linked