How do I sum a column based on another column?

Question

Assuming that the dataframe is stored as fruit, and is in the following format:

State           Fruit Category         Fruit Type         Gross Value
ACT             CitrusFruit            Mandarins          $4,500,000
ACT             CitrusFruit            Oranges            
NSW             PomeFruit              Apple              $139,130,203.50
NSW             Grapes                 Wine Production    $50,000,000
NSW             OrchardStoneFruit      Avocados           $10,031,123
QLD             CitrusFruit            Oranges

How would I sum the gross value, based on the State - while excluding blank values. But at the same time, the gross value of each state should be summed, rather than displayed separately for CitrusFruit, PomeFruit, etc.

I have tried to use the

library(plyr)
counts

method to no avail.

Any help would be greatly appreciated.

EDIT: I have tried to use the following method:

library(dplyr)
fruit %>% 
  group_by(State) %>% 
  summarise(Gross = sum(Gross))

However, I am getting an error that says:

Evaluation Error: 'sum' not meaningful for factors.

EDIT: Output from dput(fruit)

structure(list(State = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 
8L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 2L, 2L, 2L, 3L, 
3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 
8L), .Label = c("ACT", "NSW", "NT", "QLD", "SA", "TAS", "VIC", 
"WA"), class = "factor"), Fruit.Category = structure(c(6L, 6L, 
6L, 8L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 
4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L), .Label = c("  Grapes  ", "  OrchardStoneFruit  ", "  OtherFruit  ", 
"  PomeFruit  ", " CitrusFruit ", " CitrusFruit  ", " Grapes ", 
" Grapes  ", " OrchardStoneFruit ", " OtherFruit ", " PomeFruit "
), class = "factor"), Fruit.Type = structure(c(5L, 8L, 13L, 18L, 
31L, 2L, 4L, 6L, 7L, 9L, 14L, 17L, 3L, 11L, 12L, 15L, 1L, 10L, 
16L, 13L, 23L, 26L, 13L, 23L, 26L, 13L, 23L, 26L, 13L, 23L, 26L, 
13L, 23L, 26L, 13L, 23L, 26L, 13L, 23L, 26L, 18L, 31L, 18L, 31L, 
18L, 31L, 18L, 31L, 18L, 31L, 18L, 31L, 18L, 31L, 14L, 17L, 20L, 
22L, 24L, 25L, 27L, 14L, 17L, 20L, 22L, 24L, 25L, 27L, 14L, 17L, 
20L, 22L, 24L, 25L, 27L, 14L, 17L, 20L, 22L, 24L, 25L, 27L, 14L, 
17L, 20L, 22L, 24L, 25L, 27L, 14L, 17L, 20L, 22L, 24L, 25L, 27L, 
14L, 17L, 20L, 22L, 24L, 25L, 27L, 15L, 21L, 29L, 30L, 15L, 21L, 
29L, 30L, 15L, 21L, 29L, 30L, 15L, 21L, 29L, 30L, 15L, 21L, 29L, 
30L, 15L, 21L, 29L, 30L, 15L, 21L, 29L, 30L, 16L, 19L, 28L, 16L, 
19L, 28L, 16L, 19L, 28L, 16L, 19L, 28L, 16L, 19L, 28L, 16L, 19L, 
28L, 16L, 19L, 28L), .Label = c("  Apples  ", "  Avocados  ", 
"  Bananas  ", "  Cherries  ", "  Mandarins  ", "  Mangoes  ", 
"  Nectarines  ", "  Oranges  ", "  Peaches  ", "  Pears  ", 
"  Pineapples  ", "  Strawberries  ", " AllOtherCitrusFruit ", 
" AllOtherOrchardFruit ", " AllOtherOtherFruit ", " AllOtherPomeFruit ", 
" AllOtherStoneFruit ", " AllOtherUses ", " Apples ", " Avocados ", 
" Bananas ", " Cherries ", " Mandarins ", " Mangoes ", " Nectarines ", 
" Oranges ", " Peaches ", " Pears ", " Pineapples ", " Strawberries ", 
" WineProduction "), class = "factor"), Gross.Value = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 60L, 97L, 23L, 104L, 1L, 1L, 56L, 98L, 36L, 101L, 68L, 
11L, 1L, 1L, 1L, 91L, 96L, 57L, 99L, 92L, 21L, 71L, 29L, 48L, 
1L, 76L, 51L, 46L, 58L, 1L, 34L, 37L, 14L, 22L, 70L, 18L, 59L, 
28L, 32L, 41L, 83L, 61L, 69L, 30L, 1L, 1L, 26L, 1L, 1L, 25L, 
35L, 19L, 2L, 80L, 9L, 8L, 7L, 102L, 47L, 31L, 1L, 85L, 75L, 
1L, 88L, 93L, 52L, 1L, 66L, 50L, 100L, 43L, 89L, 95L, 2L, 82L, 
65L, 5L, 24L, 94L, 33L, 64L, 10L, 90L, 78L, 84L, 62L, 3L, 86L, 
20L, 73L, 1L, 38L, 67L, 72L, 15L, 63L, 1L, 1L, 39L, 17L, 1L, 
1L, 16L, 40L, 1L, 1L, 103L, 79L, 49L, 1L, 44L, 6L, 105L, 53L, 
1L, 1L, 1L, 1L, 81L, 54L, 27L, 87L, 13L, 1L, 55L, 106L, 4L, 42L, 
12L, 45L, 77L, 74L), .Label = c("", "$0.00", "$1,025,861.63", 
"$1,107,476.82", "$1,135,055.74", "$1,148,385.97", "$1,514,089.93", 
"$1,539,762.85", "$1,565,234.83", "$10,469,580.98", "$100,622,922.20", 
"$106,039,956.40", "$11,648,561.35", "$113,930,475.80", "$114,195,162.80", 
"$12,169,338.44", "$12,492,792.64", "$12,843,528.01", "$120,877,197.60", 
"$13,245.08", "$13,331,668.11", "$13,981,075.51", "$130,258,416.50", 
"$14,203,578.43", "$14,697,408.09", "$15,085,825.24", "$15,196.71", 
"$15,246,349.76", "$154,858,589.30", "$168,325.78", "$17,661,100.37", 
"$18,278,371.16", "$188,414.59", "$19,896,312.15", "$2,370,402.03", 
"$2,557,589.86", "$209,648,663.50", "$21,426,350.11", "$22,482,034.46", 
"$23,929,331.35", "$238,668.61", "$249,675,376.10", "$26,669,599.23", 
"$27,540,236.71", "$270,903.84", "$3,485,520.14", "$3,520,605.89", 
"$3,659,706.68", "$3,829,198.67", "$301,644.66", "$301,976.25", 
"$31,133,715.88", "$313,144.86", "$334,363.30", "$35,212,772.81", 
"$37,927,507.70", "$38,989,343.33", "$385,858,491.60", "$4,447,813.26", 
"$4,549,208.46", "$4,569,373.00", "$4,702.20", "$4,712,329.56", 
"$4,995,833.14", "$40,133,037.39", "$40,481.05", "$435,712,531.70", 
"$44,434,103.55", "$443,017.10", "$45,665,029.35", "$45,888,545.67", 
"$46,638,011.92", "$47,589.51", "$5,793,841.42", "$5,854,982.37", 
"$51,534,636.09", "$53,367,548.56", "$53,377,925.45", "$555,799.71", 
"$57,522,144.94", "$57,930,562.37", "$58,316,912.75", "$6,170,170.78", 
"$6,791,088.95", "$6,824,520.08", "$623,030.52", "$63,493,163.21", 
"$664,237.23", "$7,066,407.60", "$7,168,380.92", "$7,364,245.36", 
"$7,426,224.28", "$7,894.54", "$70,218,810.35", "$76,591,000.57", 
"$8,596,626.45", "$8,713,417.54", "$85,876,834.41", "$873,748.40", 
"$9,262,889.69", "$9,731,658.36", "$9,991,440.81", "$91,781,453.44", 
"$92,299.72", "$95,677,012.68", "$983,780.33"), class = "factor")), class = "data.frame", row.names = c(NA, 
-152L))

Is your `$\`Gross Value\`` actually `"$4,500,000"`? If so, you need to convert to a number first. — r2evans, Oct 18 '18 at 21:33
`fruit %>% mutate(Gross = as.numeric(gsub("[^0-9.]", "", as.character(Gross)))) %>% group_by(State) %>% summarize(Gross = sum(Gross))` (I just added the decimal to the `gsub` regex ...) — r2evans, Oct 18 '18 at 21:45
generally it's helpful to use the output of `dput(someData)` to give users some sample data to address the question — zack, Oct 18 '18 at 21:52
@r2evans Thank you, but I am getting an error with that code saying: `Error in mutate_impl(.data, dots) : Column 'Gross Value' must be length 152 (the number of rows) or one, not 0 Calls` — Nauman Shahid, Oct 18 '18 at 22:00
There are two points of pasting `dput` output to the question: (1) it gives us a much better idea of what is in your data, specifically identifying factors here though that was not in doubt; and (2) it allows us to quickly copy and paste it into our console. Unfortunately, the way you posted it included comment characters in front of everything (I've edited them out), but making it easier on us will invariably result in faster/better answers. Consider reading other "good question" techniques: Refs: https://stackoverflow.com/questions/5963269 and https://stackoverflow.com/help/mcve. — r2evans, Oct 18 '18 at 22:05

score 2 · Accepted Answer · answered Oct 18 '18 at 22:09

2

A couple of problems here:

You don't have Gross Value in your data, you have Gross.Value.
That column is factor, which is a more storage-efficient form of strings. Neither factor nor character can be summed. R knows nothing about accounting so the "$" means nothing to it in that context.

Try this:

library(dplyr)
someData %>%
  mutate(Gross.Value = as.numeric(gsub("[^0-9.]", "", as.character(Gross.Value)))) %>%
  group_by(State) %>%
  summarize(Gross.Value = sum(Gross.Value, na.rm=TRUE))
# # A tibble: 8 x 2
#   State Gross.Value
#   <fct>       <dbl>
# 1 ACT            0 
# 2 NSW    564400574.
# 3 NT      20133040.
# 4 QLD   1053007677.
# 5 SA     691850721.
# 6 TAS    112902970.
# 7 VIC   1069102796.
# 8 WA     281014929.

The only changes from my comment were (1) using the correct column name, and (2) adding na.rm=TRUE, since you have many blanks. This means you need to be careful how you use this data, as you now have biases and inaccuracies in your summary.

answered Oct 18 '18 at 22:09

r2evans

141,215
6
77
149

Thank you so much! I don't quite understand why it's `Gross.Value` rather than `Gross Value`, as in the CSV file it's listed under `Gross Value`. Also, would it be possible to use `%>% arrange(desc(n))` to display it in descending order? – Nauman Shahid Oct 18 '18 at 22:16
R does not like spaces in column names, so the default behavior of `read.csv` (and similar functions) is to clean them up, ala `make.names(c("quux", "foo bar", "quux"), unique=TRUE)`. You can force it with `read.csv(...,check.names=FALSE)`, and then you'll have to backtick the column names, ala `someData$\`Gross Value\``, which is fine if you want/need them to remain like that but can be tiresome to the pinky-finger. An alternative is to accept the changed spelling and change it back when you render into a report somewhere. – r2evans Oct 18 '18 at 22:23
Thank you so much! This is extremely helpful! – Nauman Shahid Oct 18 '18 at 22:30

score 0 · Answer 2 · answered Oct 18 '18 at 21:47

0

You should convert the factor to numeric and then sum. Here is the solution I came up with:

library(tidyverse)

##This line converts the factor into a numeric variable, by making it a character and then removing the commas and the dollar sign. Finally it converts to number
fruit$`Gross Value` <- as.numeric(str_replace_all(as.character(fruit$`Gross Value`),"\\$|\\,",""))

##Then you can run your sum function


fruit %>% 
  group_by(State) %>% 
  summarise(Gross = sum(`Gross Value`, na.rm = TRUE))

answered Oct 18 '18 at 21:47

Henry Cyranka

2,970
1
16
21

Thank you, I tried that but I am getting a strange error that I do not understand: `Error in $<-.data.frame(*tmp*, "Gross Value", value = numeric(0)) : replacement has 0 rows, data has 152 Calls` – Nauman Shahid Oct 18 '18 at 21:52

How do I sum a column based on another column?

2 Answers2