Cluster calculation tutorial - issue with spread

Question

While following this very interesting tutorial (https://rpubs.com/hrbrmstr/customer-segmentation-r), i came across an error i dont really understand.

Here is the piece of code resulting in the message 'Error: Value column 'n' does not exist in input.' in Rstudio 1.0.136:

library(readxl)
library(dplyr)
library(tidyr)
library(viridis)
library(ggplot2)
library(ggfortify)

url <- "http://blog.yhathq.com/static/misc/data/WineKMC.xlsx"
fil <- basename(url)
if (!file.exists(fil)) download.file(url, fil)

offers <- read_excel(fil, sheet = 1)
colnames(offers) <- c("offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak")
head(offers, 12)

transactions <- read_excel(fil, sheet = 2)
colnames(transactions) <- c("customer_name", "offer_id")
transactions$n <- 1
head(transactions)

left_join(offers, transactions, by="offer_id") %>% 
  count(customer_name, offer_id, wt=n) %>%
  spread(offer_id, n) %>%
  mutate_each(funs(ifelse(is.na(.), 0, .))) -> dat

The line before last is the one creating the issue.

Anybody would know why?

Generally, you should post a reproducible example here rather than using a link liable to break within a few years. Some guidance: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 Also, of course, you should identify which tools you're using beyond R itself. `spread` is not a thing in R. — Frank, Apr 17 '17 at 15:32
Sure, my bad, i amended the original post with a reproducible example — Romain, Apr 17 '17 at 15:54
Ok thanks. It's still not long-term reproducible if it needs some blog for the data. Also, it's not minimal if you're loading all these packages when presumably only dplyr is needed. The ideal is a [mcve]. Anyway, you can begin to debug this yourself by seeing if the `count` step makes a column named `n`. — Frank, Apr 17 '17 at 15:58
Without trying to regenerate data, one common problem after `*_join` commands is having same-named columns that are then renamed post-join. For instance, do either `offers` or `transactions` already contain a column named `n`? If so, then after `left_join` they are probably named `n.x` and `n.y`, which plays havoc with follow-on functions. — r2evans, Apr 17 '17 at 19:31

score 0 · Answer 1 · answered Apr 18 '17 at 01:20

Please have a look at manual page of ?dplyr::count:

Note

The column name in the returned data is usually n, even if you have supplied a weight.

If the data already already has a column named n, the output column will be called nn. If the table already has columns called n and nn then the column returned will be nnn, and so on.

In this case, the original data already has a column called n, so the new column after count would be called nn. Therefore, you have to change spread(offer_id, n) %>% to spread(offer_id, nn) %>%. That tutorial might be written before this change.

Cluster calculation tutorial - issue with spread

1 Answers1