4

I am trying to extract the colnames of a data frame, based on the values in the cells. My data is a series of a couple hundred categories, with a simple binary 0 or 1 in the cells to indicate which column name I want in my new df.

To illustrate my point:

year cat1 cat2 cat3 ... catN
2000  0    0    1        0
2001  1    0    0        0
2002  0    0    0        1
....
2018  0    1    0        0

I am trying to get a df like:

year category 
2000  cat3
2001  cat1  
2002  catN  
....
2018  cat2  

My code:

newdf <- as.data.frame(colnames(mydf)[which(mydf == "1", arr.ind = TRUE)[2]])

But alas this only returns one category name!

Any help would be greatly appreciated!

Ian Kemp
  • 28,293
  • 19
  • 112
  • 138
Dan
  • 113
  • 1
  • 9
  • have a closer look at packages `dplyr` and `tidyr` or more complicated `data.table`. They are designed to do exactly that. – Andre Elrico Jul 31 '18 at 12:49

6 Answers6

4

A base R solution:

Using sapply to find which are the ones and get the names.

out <- data.frame(year = df1$year, category = names(sapply(df1[, -1], function(x) which(x == 1))))

out
 year category
1 2000     cat1
2 2001     cat2
3 2002     cat3
4 2018     catN

data:

df1 <- structure(list(year = c(2000L, 2001L, 2002L, 2018L), cat1 = c(0L, 
1L, 0L, 0L), cat2 = c(0L, 0L, 0L, 1L), cat3 = c(1L, 0L, 0L, 0L
), catN = c(0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, 
-4L))
phiver
  • 23,048
  • 14
  • 44
  • 56
  • ninja. worked a charm. – Dan Jul 31 '18 at 13:24
  • so applying the above to my own data, i applied used as.data.frame to coerce my data into a data frame, but above code with this change is returning: argument "x" is missing, with no default... am i being thick about something? – Dan Aug 01 '18 at 10:50
  • I've copied the code as is, replacing df1 with my data frame name. what could be missing in the data argument? (to clarify the data argument in sapply(df1[,-1] etc. is df1? – Dan Aug 01 '18 at 11:44
  • it outputs all the column names or "categories" – Dan Aug 01 '18 at 11:57
  • the full sapply returns the categories with their corresponding 0s or 1s. the names(sapply...) returns something that looks like this: [1] "cat1" "cat2" [3] "cat3" "cat4" – Dan Aug 01 '18 at 12:25
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177191/discussion-between-phiver-and-dan). – phiver Aug 01 '18 at 12:40
3

A possible solution is this:

library(tidyverse)

df = data.frame(year = 2000:2002,
                cat1 = c(0,0,1),
                cat2 = c(1,0,0),
                cat3 = c(0,1,0))

df %>%
  gather(category, value, -year) %>%  # reshape data
  filter(value == 1) %>%              # keep rows with 1s
  select(-value) %>%                  # remove that column
  arrange(year)                       # order that column (if needed)

#   year category
# 1 2000     cat2
# 2 2001     cat3
# 3 2002     cat1
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
3

Another option using max.col from base R

data.frame(year = dat$year, category = names(dat[-1])[max.col(dat[-1])])
#  year category
#1 2000     cat3
#2 2001     cat1
#3 2002     catN
#4 2018     cat2

max.col finds the positions of the 1s in your data (first column excluded) and returns a numeric vector of column positions that we use to subset the column names, again first column excluded.

Data as supplied by Rui Barradas.

markus
  • 25,843
  • 5
  • 39
  • 58
1

A simple solution:

# your data
data <-data.frame(year =c(2016,2017,2018),
                  cat.1=c(0,1,0),
                  cat.2=c(0,0,1),
                  cat.3 =c(1,0,0))

# a nice library 
library(reshape2)

# from wide to long
filtered <- melt(data, id = 'year') %>% filter(value>0) 

# remove the useless column
filtered <- filtered[,1:2]
    > filtered
  year variable
1 2017    cat.1
2 2018    cat.2
3 2016    cat.3
s__
  • 9,270
  • 3
  • 27
  • 45
1

This is basicaly a Reshaping data.frame from wide to long format trick. To do it I will use package reshape2.

Then select the rows with value == 1.

result <- reshape2::melt(dat, id.vars = "year")
result <- result[result$value == 1, 1:2]
result <- result[order(result[[1]]), ]
names(result)[2] <- "category"
row.names(result) <- NULL

result
#  year category
#1 2000     cat3
#2 2001     cat1
#3 2002     catN
#4 2018     cat2

DATA.

dat <- read.table(text = "
year cat1 cat2 cat3  catN
2000  0    0    1   0
2001  1    0    0   0
2002  0    0    0   1
2018  0    1    0   0
", header = TRUE)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
1

Yet another base R option using stack

subset(data.frame(year = df$year, stack(df, select = -year)), values == 1)[-2]
#   year  ind
#2  2001 cat1
#8  2018 cat2
#9  2000 cat3
#15 2002 catN
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68