Breaking a continuous variable into categories using dplyr and/or cut

Question

I have a dataset that is a record of price changes, among other variables. I would like to mutate the price column into a categorical variable. I understand that the two functions of importance here in R seem to be dplyr and/or cut.

> head(btc_data)
                 time  btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663
   

>dput(btc_data)
        ("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763", 
        "4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325", 
        "4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025", 
        "4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075", 
        "4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738", 
        "4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788", 
        "4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038", 
        "4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288", 
        "5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788", 
        "5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175", 
        "5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350", 
        "5,013.9075"), class = "factor")), .Names = c("time", "btc_price"
    ), class = "data.frame", row.names = c(NA, -10023L))

The difficulty is in the categories I want to create. The categories -1,0,1 should be based upon the % change over the previous time-lag.

So for example, a 20% increase in price over the past 60 minutes would be labeled 1, otherwise 0. A 20% decrease in price over the past 60 minutes should be -1, otherwise 0.

Is this possible in R? What is the most efficient way to implement the change?

There is a similar question here and also here but these do not answer my question for two reasons-

a) I am trying to calculate % change, not simply the difference between 2 rows.

b) This calculation should be based on the max/min values for the rolling past time frame (ie- 20% decrease in the past hour = -1, 20% increase in the past hour = 1

dput() your data please: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Lukasz, Sep 06 '17 at 22:04
Possible duplicate of [Calculate difference between values in consecutive rows by group](https://stackoverflow.com/questions/14846547/calculate-difference-between-values-in-consecutive-rows-by-group) — lebelinoz, Sep 06 '17 at 22:33
It could also be a duplicate of https://stackoverflow.com/questions/36110977/r-how-to-calculate-percentage-of-change-between-two-rows — lebelinoz, Sep 06 '17 at 22:35
I think the `dput` output is incomplete; I don't see `structure`, `list` or any times. — neilfws, Sep 06 '17 at 22:38
@neilfws the reason could be that the max output was exceeded, but i literally just copied and pasted the output of `dput(btc_data)` — zsad512, Sep 07 '17 at 00:12
Try again with `dput(head(btc_data))`. If more than 6 rows are required for a good example dataset, try _e.g._ `dput(head(btc_data, 20))` to get 20 rows. — neilfws, Sep 07 '17 at 01:16

and-bri · Answer 1 · 2017-09-07T02:36:34.973

Its always difficult to work with percentage. You need to be aware that every thing is flexible: when you choose a reference which is a difference, a running mean, max or whatever - you have at least two variables on the side of the reference which you have to choose carefully. The same thing with the value you want to set in relation to your reference. Together this give you almost infinite possible how you can calculate your percentage. Here is the key to your question.

# create the data

dat <- c("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763", 
         "4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325", 
         "4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025", 
         "4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075", 
         "4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738", 
         "4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788", 
         "4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038", 
         "4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288", 
         "5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788", 
         "5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175", 
         "5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350", 
         "5,013.9075")
dat <- as.numeric(gsub(",","",dat))

# calculate the difference to the last minute
dd <- diff(dat)

# calculate the running ratio to difference of the last minutes
interval = 20
out <- NULL
for(z in interval:length(dd)){
  out <- c(out, (dd[z] / mean(dd[(z-interval):z])))
}

# calculate the running ratio to price of the last minutes
out2 <- NULL
for(z in interval:length(dd)){
  out2 <- c(out2, (dat[z] / mean(dat[(z-interval):z])))
}

# build categories for difference-ratio
catego <- as.vector(cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)))
catego <- c(rep(NA,interval+1), as.numeric(catego))


# plot
plot(dat, type="b", main="price orginal")
plot(dd, main="absolute difference to last minute", type="b")
plot(out, main=paste('difference to last minute, relative to "mean" of the last', interval, 'min'), type="b")
abline(h=c(0.8, 1.2), col="magenta")
plot(catego, main=paste("categories for", interval))
plot(out2, main=paste('price last minute, relative to "mean" of the last', interval, 'min'), type="b")

I think you search the way how to calculate the last plot (price last minute, relative to "mean" of t...) the value in this example vary between 1.0010 and 1.0025 so far away from what you expect with 0.8 and 1.2. You can make the difference bigger when you choose a bigger time interval than 20min maybe a week could be good (11340) but even with this high time value it will be difficult to achieve a value above 1.2. The problem is the high price of 5000 a change of 10 is very little.

You also have to take in account that you gave a continuously rising price, there it is impossible to get a value under 1.

In this calculation I use the mean() for the running observation of the last minutes. I'm not sure but I speculate that on stock markets you use both min() and max() as reference in different time interval. You choose min() as reference when your price is rising and max() when your price is falling. All this is possible in R.

@zsad512 you mean the differenz between max and min during one hour is 100%? — and-bri, Sep 07 '17 at 00:18
This should be a rolling calculation, so in the last 60 minutes: (max value-current value)/current value = 1.2 => then mark as 1 — zsad512, Sep 07 '17 at 00:25
you are sure its not this: `current_value / max_value{last 60 min} >= 1.2` ? — and-bri, Sep 07 '17 at 00:40
Actually, I think youre right- forgive me... `current_value / max_value{last 60 min} >= 1.2` = `1` but I also need to calculate `current_value / max_value{last 60 min} <= .8` = -1 and also, if not 1 or -1 then 0 — zsad512, Sep 07 '17 at 00:46
i made what you ask but...I am not sure if it is very useful. you understand what i do in the script? you have a idea about R? — and-bri, Sep 07 '17 at 00:55
`> btc_data$change <- cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)) Error in `$<-.data.frame`(`*tmp*`, change, value = c(1L, 2L, 1L, 1L, 1L, : replacement has 9962 rows, data has 10023` — zsad512, Sep 07 '17 at 01:02
To test can you try using value for the past 30 minutes instead of 60? — zsad512, Sep 07 '17 at 01:12
I was just about to add that this doesnt work because there are no `1`s — zsad512, Sep 07 '17 at 01:24
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/153822/discussion-between-and-bri-and-zsad512). — and-bri, Sep 07 '17 at 01:25

score 0 · Answer 2 · answered Sep 06 '17 at 23:38

Here's an easy way to do this without having to rely on the data.table package. If you want this for only 60 minute intervals, you would first need to filter btc_data for the relevant 60 minute intervals.

# make sure time is a date that can be sorted properly
btc_data$time = as.POSIXct(btc_data$time)

# sort data frame
btc_data = btc_data[order(btc_data$time),]

# calculate percentage change for 1 minute lag
btc_data$perc_change = NA
btc_data$perc_change[2:nrow(btc_data)] = (btc_data$btc_price[2:nrow(btc_data)] - btc_data$btc_price[1:(nrow(btc_data)-1)])/btc_data$btc_price[1:(nrow(btc_data)-1)]

# create category column
# NOTE: first category entry will be NA
btc_data$category = ifelse(btc_data$perc_change > 0.20, 1, ifelse(btc_data$perc_change < -0.20, -1, 0))

Using the data.table package and converting btc_data to a data.table would be a much more efficient and faster way to do this. There is a learning curve to using the package, but there are great vignettes and tutorials for this package.

jalind · Answer 3 · 2017-09-07T01:39:54.830

-2

I can't completely reproduce your example, but if I had to guess you would want to do something like this:

btc_data$btc_price <- as.character(btc_data$btc_price)
btc_data$btc_price <- as.data.frame(as.numeric(gsub(",", "", 
btc_data$btc_price)))


pct_change <- NULL
for (i in 61:nrow(btc_data$btc_price)){
pct_change[i] <- (btc_data$btc_price[i,] - btc_data$btc_price[i - 60,]) / 
btc_data$btc_price[i - 60,]
}


pct_change <- pct_change[61:length(pct_change)]


new_category <- cut(pct_change, breaks = c(min(pct_change), -.2, .2, 
max(pct_change)), labels = c(-1,0,1))

btc_data.new <- btc_data[61 : nrow(btc_data),]
btc.data.new <- data.frame(btc_data.new, new_category)

edited Sep 07 '17 at 01:39

answered Sep 07 '17 at 00:24

jalind

491
1
5
11

This doesnt work, almost every line throws a different error. – zsad512 Sep 07 '17 at 01:08
Syntax error. Mixed btc_data with bct_error. Sorry about that. Edits made. – jalind Sep 07 '17 at 01:42

Breaking a continuous variable into categories using dplyr and/or cut

3 Answers3

Linked