0

I have a dataframe (dtetags.df) with a date column that has many duplicate dates:

dtetags.df$Date
 "2016-07-22" "2016-07-22" "2016-07-21" "2016-07-21" "2016-07-20" "2016-07-20" "2016-07-19" "2016-07-19" "2016-07-18" "2016-07-18" "2016-07-15" "2016-07-15" "2016-07-15" "2016-07-14"
 "2016-07-14" "2016-07-13" "2016-07-13" "2016-07-13" "2016-07-12" "2016-07-12" "2016-07-12" "2016-07-12" "2016-07-11" "2016-07-11" "2016-07-11" "2016-07-11" "2016-07-08" "2016-07-08"
 "2016-07-08" "2016-07-07" "2016-07-07" "2016-07-07" "2016-07-07" "2016-07-06" "2016-07-06" "2016-07-05" "2016-07-05" "2016-07-05" "2016-07-05" "2016-07-01" "2016-07-01" "2016-06-30"
 "2016-06-30" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-28" "2016-06-28" "2016-06-28" "2016-06-27" "2016-06-27" "2016-06-27" "2016-06-24" "2016-06-24"
 "2016-06-23" "2016-06-23" "2016-06-22" "2016-06-22" "2016-06-21" "2016-06-21" "2016-06-20" "2016-06-20" "2016-06-17" "2016-06-17" "2016-06-16" "2016-06-16" "2016-06-15" "2016-06-15"
 "2016-06-14" "2016-06-13" "2016-06-13" "2016-06-10" "2016-06-10" "2016-06-09" "2016-06-09" "2016-06-09" "2016-06-09" "2016-06-08" "2016-06-08" "2016-06-07" "2016-06-07" "2016-06-06"
 "2016-06-06" "2016-06-06" "2016-06-01" "2016-06-01" "2016-05-29" "2016-05-29" "2016-05-27" "2016-05-27" "2016-05-26" "2016-05-26" "2016-05-25" "2016-05-25" "2016-05-24" "2016-05-23"
 "2016-05-23" "2016-05-20"

and a number of binary tag columns that show whether a post was made with that tag on that date, for example:

dtetags.df$Technology
 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "1" "1" "0" "1" "0" "1"
 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"

and I am trying to use ddply(dtetags.df,"Date",numcolwise(sum)) based on this question but it returns this error message <0 rows> (or 0-length row.names). I have tried a number of different ways to format the ddply command, but I cannot get it to work.

The ideal output would look like:

               Date            Technology
1        2016-07-22                     0
2        2016-07-21                     0
3        2016-07-20                     0
4        2016-07-19                     0
5        2016-07-18                     0
6        2016-07-15                     0
7        2016-07-14                     0
8        2016-07-13                     0
9        2016-07-12                     0
10       2016-07-11                     0
11       2016-07-08                     0
12       2016-07-07                     0
13       2016-07-06                     1
14       2016-07-05                     0
15       2016-07-01                     2
16       2016-06-30                     1
17       2016-06-29                     1
18       2016-06-28                     0
19       2016-06-27                     0
20       2016-06-24                     1
21       2016-06-23                     0
22       2016-06-22                     0
23       2016-06-21                     0
24       2016-06-20                     0
25       2016-06-17                     0
26       2016-06-16                     0
27       2016-06-15                     0
28       2016-06-14                     1
29       2016-06-13                     0
30       2016-06-10                     0
31       2016-06-09                     0
32       2016-06-08                     0
33       2016-06-07                     0
34       2016-06-06                     0
35       2016-06-01                     0
36       2016-05-29                     0
37       2016-05-27                     0
38       2016-05-26                     0
39       2016-05-25                     0
40       2016-05-24                     0
41       2016-05-23                     0
42      2016-05-20                      0

Is there something obvious I am doing wrong?

Conversion from Factor to Numeric

I removed the Date column, applied data.frame(apply(dtetags.df, 2, function(x) as.numeric(as.character(x)))) to the rest of the data frame, and prepended the Date column back in.

dput(dtetags.df)
structure(list(Date = c("2016-07-22", "2016-07-22", "2016-07-21", 
"2016-07-21", "2016-07-20", "2016-07-20", "2016-07-19", "2016-07-19", 
"2016-07-18", "2016-07-18", "2016-07-15", "2016-07-15", "2016-07-15", 
"2016-07-14", "2016-07-14", "2016-07-13", "2016-07-13", "2016-07-13", 
"2016-07-12", "2016-07-12", "2016-07-12", "2016-07-12", "2016-07-11", 
"2016-07-11", "2016-07-11", "2016-07-11", "2016-07-08", "2016-07-08", 
"2016-07-08", "2016-07-07", "2016-07-07", "2016-07-07", "2016-07-07", 
"2016-07-06", "2016-07-06", "2016-07-05", "2016-07-05", "2016-07-05", 
"2016-07-05", "2016-07-01", "2016-07-01", "2016-06-30", "2016-06-30", 
"2016-06-29", "2016-06-29", "2016-06-29", "2016-06-29", "2016-06-29", 
"2016-06-28", "2016-06-28", "2016-06-28", "2016-06-27", "2016-06-27", 
"2016-06-27", "2016-06-24", "2016-06-24", "2016-06-23", "2016-06-23", 
"2016-06-22", "2016-06-22", "2016-06-21", "2016-06-21", "2016-06-20", 
"2016-06-20", "2016-06-17", "2016-06-17", "2016-06-16", "2016-06-16", 
"2016-06-15", "2016-06-15", "2016-06-14", "2016-06-13", "2016-06-13", 
"2016-06-10", "2016-06-10", "2016-06-09", "2016-06-09", "2016-06-09", 
"2016-06-09", "2016-06-08", "2016-06-08", "2016-06-07", "2016-06-07", 
"2016-06-06", "2016-06-06", "2016-06-06", "2016-06-01", "2016-06-01", 
"2016-05-29", "2016-05-29", "2016-05-27", "2016-05-27", "2016-05-26", 
"2016-05-26", "2016-05-25", "2016-05-25", "2016-05-24", "2016-05-23", 
"2016-05-23", "2016-05-20"), `Technology` = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Date", 
"Technology"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -100L))
Community
  • 1
  • 1
arebearit
  • 15
  • 4
  • 3
    Please show a reproducible small example using `dput` and your expected output – akrun Aug 01 '16 at 17:27
  • Your input and expected output seems to be having different values. Perhaps `library(dplyr);dtetags.df %>% group_by(Date) %>% mutate(new = row_number() * as.numeric(as.character(Technology)))` – akrun Aug 01 '16 at 18:17
  • Is there a generalizable solution? That's what I was trying to do by not specifying the column. Also, I am slightly confused about what you mean about differing inputs/outputs. Thanks! – arebearit Aug 01 '16 at 18:34
  • @arebearit: if you mean apply the summarise to all columns, then you can use `dplyr` with `summaries_each`, but we are still trying to determine what you want exactly. A consistent input and output and output example would help. – aichao Aug 01 '16 at 18:41
  • I've corrected an inconsistency in the output, but the general theme is that this ddply function will take every instance of date duplication and sum across those rows to give a sort of composite value. Is that what you mean by inconsistent input/output? – arebearit Aug 01 '16 at 18:57
  • Oh, I get it now. I think the unusual factor structure is left over from when I converted these from strings (see `Label = c("0", "1")`). Is there a way to map the `as.numeric` command over each element of a data frame without losing its data frame structure? I tried using the method in @akrun's answer here: [link](http://stackoverflow.com/questions/27528907/how-to-convert-data-frame-column-from-factor-to-numeric) but it dislocates the data from the column titles. – arebearit Aug 01 '16 at 19:25

1 Answers1

0

To accomplish what you want, you can use the dplyr package:

library(dplyr)
out <- dtetags.df %>% group_by(Date) %>% summarise_each(funs(sum)) %>% arrange(desc(Date))

Notes:

  1. group_by the Date, which means that the subsequent operation will be over the group of rows with the same date.
  2. Use sum function to summarize each column (other than Date).
  3. Use arrange to sort the results in descending order by date.

Given the input data, the output is as expected:

print(out)
# A tibble: 42 x 2
     Date     Technology
    <chr>          <dbl>
1  2016-07-22          0
2  2016-07-21          0
3  2016-07-20          0
4  2016-07-19          0
5  2016-07-18          0
6  2016-07-15          0
7  2016-07-14          0
8  2016-07-13          0
9  2016-07-12          0
10 2016-07-11          0
11 2016-07-08          0
12 2016-07-07          0
13 2016-07-06          1
14 2016-07-05          0
15 2016-07-01          2
16 2016-06-30          1
17 2016-06-29          1
18 2016-06-28          0
19 2016-06-27          0
20 2016-06-24          1
21 2016-06-23          0
22 2016-06-22          0
23 2016-06-21          0
24 2016-06-20          0
25 2016-06-17          0
26 2016-06-16          0
27 2016-06-15          0
28 2016-06-14          1
29 2016-06-13          0
30 2016-06-10          0
31 2016-06-09          0
32 2016-06-08          0
33 2016-06-07          0
34 2016-06-06          0
35 2016-06-01          0
36 2016-05-29          0
37 2016-05-27          0
38 2016-05-26          0
39 2016-05-25          0
40 2016-05-24          0
41 2016-05-23          0
42 2016-05-20          0

Caveats: this requires that all rows other than Date in dtetags.df are numeric. If they are not, then they should be converted prior to applying this code. This can be done using the answer found here

Hope this helps.

Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18