1

I'm trying to aggregate revenue data by day. The column contains both NAs and actual revenue data intermingled. Revenue is currently a 'character.' The challenge is that you cannot sum 'characters' and any attempt to change the class breaks.

Changing class via as.numeric yields this error: "NAs introduced by coercion"

or as.numeric(as.character()) completely wipes the data in that column and returns all NAs.

Date column is in date format. Here's the aggregation I'm trying to execute.

df_agg<-aggregate(df$revenue,by=list(df$Date),sum,na.rm=TRUE)

Executing the above code of course results in: "Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : ‘sum’ not meaningful for factors"

The data is originally read via CSV. I've tried reading both with/without stringsAsFactors=FALSE

Example code
Code    Date    Visits  Revenue
1A  1/1/2018    3   5
2A  1/1/2018    5   NA
3A  1/2/2018    8   7
4A  1/3/2018    6   2
5A  1/3/2018    8   5
Brett
  • 25
  • 4
  • Without your data, this is pretty hard to diagnose. Please include in your question the first few lines of your data file and also the code that you used to read it into R. It would probably help if you read [mcve]. – G5W Jun 22 '18 at 16:44
  • @G5W example added – Brett Jun 22 '18 at 17:03
  • Two things. 1. I notice that your sample data does not contain any NAs, and so it does not support testing solutions to your problem 2. Your `aggregate` statement says `na.rn=TRUE` But surely that should be `na.rm=TRUE` – G5W Jun 22 '18 at 17:09
  • Yikes... typos. The limitations of not uploading a 300000 line dataframe – Brett Jun 22 '18 at 17:19
  • Does the original data file look exactly like the example code you included? If not, please post a subset that does... and that reproduces the problem when you try to read it into R. Also include the call you used to read it in – De Novo Jun 22 '18 at 17:31
  • Once the data is read in as factor, see [this answer](https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-integer-numeric-without-loss-of-information) to convert it to numeric. If you want help reading the data, you'll need to show us what the data looks like before you read it in. – De Novo Jun 22 '18 at 17:36

2 Answers2

1

I suspect the root of your problem is how you're reading the data into R. There are a number of reasons a particular call to a read function will end up giving you a character column of numbers. For read.table or read.csv, if there is anything in 300000 entries in the revenue column that is not something scan() interprets as a real the column won't convert to numeric. This can be a typo, something other than NA to indicate missing values, among other things. You can get an informative error message if you force the read function to attempt to convert your column to numeric, by including a colClasses argument. Here's an example:

df <- read.table(text = "Code    Date    Visits  Revenue
1A  1/1/2018    3   5
                 2A  1/1/2018    5   NA
                 3A  1/2/2018    8   7
                 4A  1/3/2018    6   2
                 5A  1/3/2018    8   5
                 6A  1/3/2018    7   2w", header = TRUE,
                 colClasses = c("character", "Date", "numeric", "numeric"), 
                 stringsAsFactors = FALSE)

This produces the error message:

`Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : scan() expected 'a real', got '2w'``

Here, you know exactly what the problem is. It can't convert the column to numeric because of, presumably, a fat finger typo. Change it in your source file, or alternatively, add to a character vector you pass as na.string.

df <- read.table(text = "Code    Date    Visits  Revenue
1A  1/1/2018    3   5
                 2A  1/1/2018    5   NA
                 3A  1/2/2018    8   7
                 4A  1/3/2018    6   2
                 5A  1/3/2018    8   5
                 6A  1/3/2018    7   2w", header = TRUE,
                 colClasses = c("character", "Date", "numeric", "numeric"), 
                 stringsAsFactors = FALSE,
                 na.string = c("NA", "2w"))
df
#   Code       Date Visits Revenue
# 1   1A 0001-01-20      3       5
# 2   2A 0001-01-20      5      NA
# 3   3A 0001-02-20      8       7
# 4   4A 0001-03-20      6       2
# 5   5A 0001-03-20      8       5
# 6   6A 0001-03-20      7      NA

Especially when I have a very large file to read in, I always pass a colClasses argument. Not only does it help you diagnose and solve these kind of problems, it also substantially speeds up reading in the file.

De Novo
  • 7,120
  • 1
  • 23
  • 39
0

It may be related to how you read the data into R, but (after fixing another typo Revenue != revenue) your code worked for me.

df = read.table(text="Code    Date    Visits  Revenue
1A  1/1/2018    3   5
2A  1/1/2018    5   NA
3A  1/2/2018    8   7
4A  1/3/2018    6   2
5A  1/3/2018    8   5", 
header=TRUE)

df_agg <- aggregate(df$Revenue,by=list(df$Date),sum,na.rm=TRUE)
   Group.1 x
1 1/1/2018 5
2 1/2/2018 7
3 1/3/2018 7
G5W
  • 36,531
  • 10
  • 47
  • 80
  • I think (but there's no way of knowing for sure), that the problem isn't at this level. the OP indicated the column was character when read in. – De Novo Jun 22 '18 at 17:32
  • Yeah, I don't think that's what his .csv file looks like, and he couldn't use that particular call to `read.table`, even if it did. Props for offering a solution, I'm just commenting re: the lack of clarity in the question. – De Novo Jun 22 '18 at 17:37
  • Either it will answer his question or provoke him to see how his question doesn't exactly describe the problem. :-) – G5W Jun 22 '18 at 17:42