1

I have the following the following data frame (‘mydf’):

ID   Type   Count  Mark
U1     A     50     R1
U1     A     50     R3
U1     A     50     R4
U1     A     50     R3
U2     B     24     R2
U2     B     24     R5
U3     A     88     R1

Each row contains 4 variables:

  • ID: the user’s ID.

  • Type: the user’s type, which is categorical and consistent for all rows with that ID.

  • Count: the user’s count, which is continuous and consistent for all rows with that ID.

  • Mark: one of 15 categorical symbols, which may vary between different rows with the same ID.

Presently, there is one row for each mark which appears, so that there are multiple rows per ID.

The number of marks is different for different IDs (anything from 4-50).

What I a need to do:

  1. Consolidate rows with the same ID into a single row, which contains all the consistent information (Type/Count), as well as a listing (or count) of all the marks which appear for that ID. This listing needs to reflect how often each mark appears for that ID (for example, if the same mark appears twice for a single ID, this needs to be reflected, as happens for U1 in rows 2 and 4).

  2. Get a count of the number of time each mark appears for each ID.

  3. Get a count of the total number of marks for each ID.

Listing of the marks for each ID is not necessary as long as the counts described in 2 and 3 exist (see for example the expected output below).

Expected output (approximately):

ID   Type   Count  Mark_R1 Mark_R2 Mark_R3 Mark_R4 Mark_R5 Mark_Total
U1     A     50       1       0       2       1       0       3
U2     B     24       0       1       0       0       1       2
U3     A     88       1       0       0       0       0       1

Thank you for your help!

Thredolsen
  • 247
  • 1
  • 11

1 Answers1

0

We can use dplyr

library(dplyr)
library(tidyr)
mydf %>%
     count(ID, Type, Count, Mark) %>%
     spread(Mark, n, fill = 0) %>%
     ungroup() %>% 
     mutate(Total = rowSums(.[grep("R\\d+", names(.))]))
#     ID  Type Count    R1    R2    R3    R4    R5 Total
#   <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1    U1     A    50     1     0     2     1     0     4
#2    U2     B    24     0     1     0     0     1     2
#3    U3     A    88     1     0     0     0     0     1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Also using dplyr: df %>% group_by(ID,Type, Count, Mark) %>% summarize(a = n()) %>% spread(Mark,a,fill=0) – PhilC Aug 01 '16 at 19:34
  • I have tested this solution and it does work, so I marked it as the selected answer. However, I recommend using the solution suggested by @DavidArenburg, which requires less code and produces a cleaner result: library(reshape2) ; dcast(mydf, ID + Type + Count ~ Mark, margins = "Mark") – Thredolsen Aug 02 '16 at 22:31
  • @Thredolsen You are right, the `dcast` approach is more concise. – akrun Aug 03 '16 at 02:52