3

I would like to sum a single column of data that was output from an sqldf function in R.

I have a csv. file that contains groupings of sites with a uniqueID and their associated areas. For example:

occurrenceID                              sarea
{0255531B-904F-4E2D-B81D-797A21165A2F}  0.30626786
{0255531B-904F-4E2D-B81D-797A21165A2F}  0.49235953
{0255531B-904F-4E2D-B81D-797A21165A2F}  0.03490536
{0255531B-904F-4E2D-B81D-797A21165A2F}  0.00001389
{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}  0.0302389
{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}  0.01360811
{1EC60400-0AD0-4DB5-B815-221C4123AE7F}  0.08412911
{1EC60400-0AD0-4DB5-B815-221C4123AE7F}  0.01852466

I used the code below in R to pull out the largest area from each grouping of unique ID's.

> MyData <- read.csv(file="sacandaga2.csv", header=TRUE, sep=",")
> sqldf("select max(sarea),occurrenceID from MyData group by occurrenceID")

This produced the following output:

 max(sarea)                           occurrenceID
1  0.49235953 {0255531B-904F-4E2D-B81D-797A21165A2F}
2  0.03023890 {175A4B1C-CA8C-49F6-9CD6-CED9187579DC}
3  0.08412911 {1EC60400-0AD0-4DB5-B815-221C4123AE7F}
4  0.00548259 {2412E244-2E9A-4477-ACC6-1EB02503BE75}
5  0.00295924 {40450574-ABEB-48E3-9BE5-09B5AB65B465}
6  0.01403846 {473FB631-D398-46B7-8E85-E63540BDFF92}
7  0.00257519 {4BABDE22-E8E0-435E-B60D-0BB9A84E1489}
8  0.02158115 {5F616A33-B028-46B1-AD92-89EAC1660C41}
9  0.00191211 {70067496-25B6-4337-8C70-782143909EF9}
10 0.03049355 {7F858EBB-132E-483F-BA36-80CE889373F5}
11 0.03947298 {9A579565-57EC-4E46-95ED-79724FA6F2AB}
12 0.02464722 {A9010BA3-0FE1-40B1-96A7-21122261A003}
13 0.00136672 {AAD710BF-1539-4235-87F1-34B66CF90781}
14 0.01139146 {AB1286C3-DBE3-467B-99E1-AEEF88A1B5B2}
15 0.07954269 {BED0433A-7167-4184-A25F-B9DBD358AFFB}
16 0.08401067 {C4EF0F45-5BF7-4F7C-BED8-D6B2DB718CB2}
17 0.04289261 {C58AC2C6-BDBE-4FE5-BD51-D70BBDFB4DB5}
18 0.03151558 {D4230F9C-80E4-454A-9D5D-0E373C6DCD9A}
19 0.00403585 {DD76A03A-CFBF-41E9-A571-03DA707BEBDA}
20 0.00007336 {E20DE254-8A0F-40BE-90D2-D6B71880E2A8}
21 9.81847859 {F382D5A6-F385-426B-A543-F5DE13F94564}
22 0.00815881 {F9032905-074A-468F-B60E-26371CF480BB}
23 0.24717113 {F9E5DC3C-4602-4C80-B00B-2AF1D605A265}

Now I would like to sum all the values in the max(sarea) column. What is the best way to accomplish this?

Zsimek
  • 33
  • 5
  • 1
    I'm not very familiar with `sqldf`, but is this data frame some kind of special `sqldf` object or do you want to do the summing using `sqldf`? Otherwise, can't you just use `sum`? – divibisan Apr 02 '19 at 14:38

4 Answers4

1

Either do it in sqldf or R, or assign your existing result and do it in R:

# assign your original 
grouped_sum = sqldf("select max(sarea),occurrenceID from MyData group by occurrenceID")
# and sum in R
sum(grouped_sum$`max(sarea)`)

# you might prefer to use a standard column name so you don't need backticks
grouped_sum = sqldf(
  "select max(sarea) as max_sarea, occurrenceID
   from MyData 
   group by occurrenceID"
)
sum(grouped_sum$max_sarea)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

If the intention is to do this in a single 'sqldf' call, use with

library(sqldf)
sqldf("with tmpdat AS (
    select max(sarea) as mxarea, occurrenceID 
     from MyData group by occurrenceID
    ) select sum(mxarea) 
         as smxarea from tmpdat")
#   smxarea
#1 0.6067275

data

MyData <- 
structure(list(occurrenceID = c("{0255531B-904F-4E2D-B81D-797A21165A2F}", 
"{0255531B-904F-4E2D-B81D-797A21165A2F}", "{0255531B-904F-4E2D-B81D-797A21165A2F}", 
"{0255531B-904F-4E2D-B81D-797A21165A2F}", "{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}", 
"{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}", "{1EC60400-0AD0-4DB5-B815-221C4123AE7F}", 
"{1EC60400-0AD0-4DB5-B815-221C4123AE7F}"), sarea = c(0.30626786, 
0.49235953, 0.03490536, 1.389e-05, 0.0302389, 0.01360811, 0.08412911, 
0.01852466)), class = "data.frame", row.names = c(NA, -8L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Question: is there a benefit in using `with` and a temporary data instead of applying `from` directly on the initial output? – M-- Apr 02 '19 at 15:16
  • @M-M It is just one way. For group by and other steps, I would use tidyverse or data.table instead of the sqldf way – akrun Apr 02 '19 at 15:18
  • 1
    thanks. I am using `sql` in another project outside of R and was asking about that matter. I am sure your `data.table` solution would be much more efficient :D – M-- Apr 02 '19 at 15:21
  • @M-M It makes sense to use `sql` in some cases. But, for this example, not so sure – akrun Apr 02 '19 at 15:22
  • @M-M `with` (common table expression) is often more readable than subqueries---especially if you need to go more than one level deep or need to reference the intermediate result more than once. – Gregor Thomas Apr 02 '19 at 16:53
  • 1
    [Generally doesn't matter](https://stackoverflow.com/q/11169550/903061). – Gregor Thomas Apr 02 '19 at 17:29
1

You can do it by getting the sum of maximum values:

sqldf("select sum(max_sarea) as sum_of_max_sarea 
          from (select max(sarea) as max_sarea,
          occurrenceID from Mydata group by occurrenceID)")


#   sum_of_max_sarea
# 1        0.6067275

Data:

Mydata <- structure(list(occurrenceID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), 
.Label = c("0255531B-904F-4E2D-B81D-797A21165A2F", "175A4B1C-CA8C-49F6-9CD6-CED9187579DC", 
           "1EC60400-0AD0-4DB5-B815-221C4123AE7F"), class = "factor"), 
sarea = c(0.30626786, 0.49235953, 0.03490536, 1.389e-05, 0.0302389, 
          0.01360811, 0.08412911, 0.01852466)), class = "data.frame", 
row.names = c(NA, -8L))
M--
  • 25,431
  • 8
  • 61
  • 93
0

If DF is the last data frame shown in the question this sums the numeric column:

sqldf("select sum([max(sarea)]) as sum from DF")
##        sum
## 1 11.07853

Note

We assume this data frame shown in reproducible form:

Lines <- "max(sarea)                           occurrenceID
1  0.49235953 {0255531B-904F-4E2D-B81D-797A21165A2F}
2  0.03023890 {175A4B1C-CA8C-49F6-9CD6-CED9187579DC}
3  0.08412911 {1EC60400-0AD0-4DB5-B815-221C4123AE7F}
4  0.00548259 {2412E244-2E9A-4477-ACC6-1EB02503BE75}
5  0.00295924 {40450574-ABEB-48E3-9BE5-09B5AB65B465}
6  0.01403846 {473FB631-D398-46B7-8E85-E63540BDFF92}
7  0.00257519 {4BABDE22-E8E0-435E-B60D-0BB9A84E1489}
8  0.02158115 {5F616A33-B028-46B1-AD92-89EAC1660C41}
9  0.00191211 {70067496-25B6-4337-8C70-782143909EF9}
10 0.03049355 {7F858EBB-132E-483F-BA36-80CE889373F5}
11 0.03947298 {9A579565-57EC-4E46-95ED-79724FA6F2AB}
12 0.02464722 {A9010BA3-0FE1-40B1-96A7-21122261A003}
13 0.00136672 {AAD710BF-1539-4235-87F1-34B66CF90781}
14 0.01139146 {AB1286C3-DBE3-467B-99E1-AEEF88A1B5B2}
15 0.07954269 {BED0433A-7167-4184-A25F-B9DBD358AFFB}
16 0.08401067 {C4EF0F45-5BF7-4F7C-BED8-D6B2DB718CB2}
17 0.04289261 {C58AC2C6-BDBE-4FE5-BD51-D70BBDFB4DB5}
18 0.03151558 {D4230F9C-80E4-454A-9D5D-0E373C6DCD9A}
19 0.00403585 {DD76A03A-CFBF-41E9-A571-03DA707BEBDA}
20 0.00007336 {E20DE254-8A0F-40BE-90D2-D6B71880E2A8}
21 9.81847859 {F382D5A6-F385-426B-A543-F5DE13F94564}
22 0.00815881 {F9032905-074A-468F-B60E-26371CF480BB}
23 0.24717113 {F9E5DC3C-4602-4C80-B00B-2AF1D605A265}"
DF <- read.table(text = Lines, check.names = FALSE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341