0

I am new to R, and I am trying to figure out how to combine data into a unique format, and then input that data into a postgres database. Here is code to provide an example of the format:

df <- data.frame("1"=c("us","fr","us","fr","de", "de"), "2"=c(35, 20, 35, 20, 22.25, 125), "3"=c(105, -67.56, 105, -67.56, 138, 12), "4"=c(100, 200, 80, 160, 50, 18))

colnames(df) <-  c(
    "countryAbr",
    "latitude",
    "longitude",
    "countOfResidents"
    )

Here is the current data set:

  countryAbr latitude longitude countOfResidents
1         us    35.00    105.00              100
2         fr    20.00    -67.56              200
3         us    35.00    105.00               80
4         fr    20.00    -67.56              160
5         de    22.25    138.00               50
6         de    125      12                   18

I want to combine the data, based on total occurrences of a country, taking into account both unique and duplicate latitude and longitude coordinates. I also want to take the sum of residents and combine it. Here is my expected final result:

  countryAbr TotalCountryOccurances TotalResidentCount
1         us                      2                180
2         fr                      2                360
3         de                      2                 68

I used the count function to get the total occurrences of the country (I think??), but not sure how to combine everything and what functions to use.

countryCount <- count(df[,c("latitude", "longitude")])

When I do have the final data set, I would like to put it into a postgres table for querying and use on the front end. I know how to do the latter, not sure how to get R data into Postgres table though.

** Edit for clarity regarding duplicate and unique lat and long**

unseen_damage
  • 1,346
  • 1
  • 14
  • 32
  • Suggested dupe: [how to sum data by group in R](http://stackoverflow.com/q/1660124/903061) – Gregor Thomas Sep 18 '16 at 01:18
  • using `dplyr`, ```df %>% group_by(countryAbr) %>% summarize(TotalCountryOccurances = n(), TotalResidentCount = sum(countOfResidents))``` gives your needed result. – yeedle Sep 18 '16 at 20:46

2 Answers2

2

If you want to count the number of rows and sum the number of residents for each unique countryAbr, you can use dplyr to summarise these counts after grouping by countryAbr:

library(dplyr)
result <- df %>% group_by(countryAbr) %>% 
                 summarise(TotalCountryOccurances=n(), TotalResidentCount=sum(countOfResidents))

Since we are grouped by each unique value in countryAbr, the function n() returns the number of rows for the group and the function sum computes the sum over the column countOfResidents for the group. In this case, the counting of rows and the sum is over all values of latitude and longitude for each unique value in countryAbr. I'm assuming that is what you mean by:

taking into account both unique and duplicate latitude and longitude coordinates.

With the data you provided:

print(result)
### A tibble: 3 x 3
##  countryAbr TotalCountryOccurances TotalResidentCount
##      <fctr>                  <int>              <dbl>
##1         de                      1                 50
##2         fr                      2                360
##3         us                      2                180
aichao
  • 7,375
  • 3
  • 16
  • 18
1

You can use the data.table package

library(data.table)
setDT(df)
df[, .(TotalCountryOccurances = .N, TotalResidentCount = sum(countOfResidents)), 
   by = countryAbr]
dww
  • 30,425
  • 5
  • 68
  • 111
  • the data.table package has numerous speed and syntax advantages. When learning R, you should just start off storing tables as data.tables instead of data frames or matricies. – hedgedandlevered Sep 18 '16 at 04:18