2

I want to convert a data frame (df) into a different data frame (df1), where the rows of df1 correspond to the frequency of occurrence of column B (df). That is, in data frame df1, the values in the rows correspond to the number of times they appear in column B of df for a particular variable of column 'A' of df. Here is an example:

df


      A        B
  1. 21      2011
  2. 21      2012
  3. 21      2011
  4. 22      2013
  5. 22      2011
  6. 23      2012
  7. 23      2011
  8. 23      2012
  9. 23      2014

df1

           2011   2012    2013    2014
  1. 21     2       1       0       0
  2. 22     1       0       1       0
  3. 23     1       2       0       1
gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
Ryan Zfir
  • 23
  • 5

3 Answers3

4

Using the table function in base-r:

table(df)
#      B
# A    2011 2012 2013 2014
# 21    2    1    0    0
# 22    1    0    1    0
# 23    1    2    0    1
Deena
  • 5,925
  • 6
  • 34
  • 40
  • I want to convert 'df' into 'df1 ( another data frame)' and then after in Vectors so that I can use "dist" function for calculating distances and also my original data has 1lakh of rows so while converting the above table into data frame is not working. error: cannot allocate of vector of size 2.6GB – Ryan Zfir Jul 22 '17 at 13:04
  • @RyanZfir How do you convert to data.frame? Try `as.data.frame.matrix(table(df))` – Gregory Demin Jul 22 '17 at 13:45
  • yes, I used the same command. Actually, my main objective to change the above table into Vector so that I can use "dist" function to calculate distances between each vectors using euclidian method. – Ryan Zfir Jul 22 '17 at 16:55
  • @RyanZfir Did this error occur after the `table` or after the `as.data.frame` or after the `dist`? Generally, you can calculate distance without conversion to data.frame: `dist(table(df))` – Gregory Demin Jul 22 '17 at 19:43
1
df = read.table(text="          A        B
      1. 21      2011
                2. 21      2012
                3. 21      2011
                4. 22      2013
                5. 22      2011
                6. 23      2012
                7. 23      2011
                8. 23      2012
                9. 23      2014",header=T)

library(dplyr)
df = df %>% group_by(A,B) %>% mutate(n=n()) %>% ungroup %>% complete(A, B, fill = list(n = 0)) %>% as.data.frame 
df<- reshape(df,timevar="B",idvar="A",direction="wide")
colnames(df)<- gsub("n\\.","",colnames(df))

Output

   A   2011   2012   2013   2014
1 21      2      1      0      0
4 22      1      0      1      0
6 23      1      2      0      1
Florian
  • 24,425
  • 4
  • 49
  • 80
1

Using dplyr and tidyr:

> library(tidyr)

Input:

> df <- read.csv(text="A,B
21,2011
21,2012
21,2011
22,2013
22,2011
23,2012
23,2011
23,2012
23,2014", header=T)    

Group, summarise each group (tally is shorthand for summarise(N = n())) and spread into columns (spread):

> df2 <- df %>% group_by(A,B) %>% tally %>% spread(B, n)
> df2[is.na(df2)] <- 0

Output:

> df2
# A tibble: 3 x 5
# Groups:   A [3]
      A `2011` `2012` `2013` `2014`
* <int>  <int>  <dbl>  <dbl>  <dbl>
1    21      2      1      0      0
2    22      1      0      1      0
3    23      1      2      0      1

Or, by using the built-in table function and spreading using tidyr afterwards (thanks to user C.Square):

> table(df) %>% data.frame %>% spread(B, Freq)
   A 2011 2012 2013 2014
1 21    2    1    0    0
2 22    1    0    1    0
3 23    1    2    0    1
shosaco
  • 5,915
  • 1
  • 30
  • 48