Aggregating and counting data in R

Question

I have a huge data frame in R some of my columns are

STATE FIPS 

COUNTY FIPS 

EVENT TYPE

EPISODE

I want to aggregate the data as follows

first split it up by state, and then by county within each state, then by event type within each county, and then finally remove all the duplicate "episodes" within each event type. Finally, I want to count the value of each and have that be the output.

I would like the final data frame to be a list of

Every county (with its respective state), the event type, and the count of episode numbers for each event.

For example.

Maryland    Montgomery  Tornado 7

Maryland    Howard      Tornado    19

New York    Ontario    Hurricane 7

and so on so that each county and each event has its own row.

Happy to post dummy data but I am not sure how to do that... thanks!

structure(list(EPISODE_ID = c(1099647L, 1104972L, 1104973L, 1104971L, 1105463L, 1099373L, 1104298L, 1105465L, 1102576L, 1102576L, 1102576L, 1102581L, 1101019L, 1099291L, 1099292L, 1105466L, 1105467L), EVENT_ID = c(5176080, 5166972, 5166973, 5166971, 5167810, 5163387, 5161925, 5168217, 5171130, 5171131, 5171132, 5171137, 5173230, 5164469, 5164470, 5168218, 5168219), STATE = c("ALASKA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALASKA", "ALASKA", "ALASKA", "ALABAMA", "ALABAMA"), STATE_FIPS = c(2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1), YEAR = c(2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000), MONTH_NAME = c("August", "December", "December", "December", "December", "July", "November", "December", "September", "September", "September", "September", "September", "July", "July", "December", "December"), EVENT_TYPE = c("Flash Flood", "Tornado", "Tornado", "Tornado", "Tornado", "Flash Flood", "Tornado", "Tornado", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Tornado", "Tornado"), CZ_TYPE = c("C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C" ), CZ_FIPS = c(14, 115, 55, 125, 61, 73, 65, 69, 1, 101, 51, 111, 14, 7, 5, 45, 67)), row.names = c(NA, -17L), class = c("data.table", "data.frame"))

This display :

   EPISODE_ID EVENT_ID   STATE STATE_FIPS YEAR MONTH_NAME  EVENT_TYPE CZ_TYPE
1     1099647  5176080  ALASKA          2 2000     August Flash Flood       C
2     1104972  5166972 ALABAMA          1 2000   December     Tornado       C
3     1104973  5166973 ALABAMA          1 2000   December     Tornado       C
4     1104971  5166971 ALABAMA          1 2000   December     Tornado       C
5     1105463  5167810 ALABAMA          1 2000   December     Tornado       C
6     1099373  5163387 ALABAMA          1 2000       July Flash Flood       C
7     1104298  5161925 ALABAMA          1 2000   November     Tornado       C
8     1105465  5168217 ALABAMA          1 2000   December     Tornado       C
9     1102576  5171130 ALABAMA          1 2000  September Flash Flood       C
10    1102576  5171131 ALABAMA          1 2000  September Flash Flood       C
11    1102576  5171132 ALABAMA          1 2000  September Flash Flood       C
12    1102581  5171137 ALABAMA          1 2000  September Flash Flood       C
13    1101019  5173230  ALASKA          2 2000  September Flash Flood       C
14    1099291  5164469  ALASKA          2 2000       July Flash Flood       C
15    1099292  5164470  ALASKA          2 2000       July Flash Flood       C
16    1105466  5168218 ALABAMA          1 2000   December     Tornado       C
17    1105467  5168219 ALABAMA          1 2000   December     Tornado       C
   CZ_FIPS
1       14
2      115
3       55
4      125
5       61
6       73
7       65
8       69
9        1
10     101
11      51
12     111
13      14
14       7
15       5
16      45
17      67

Can you share reproducible example using R's dput() function? — Karthik S, Oct 19 '20 at 17:21
See [here for a FAQ on making reproducible examples in R](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). `dput()` is the friendliest way, as it is copy/pasteable and retains all class and structure information. `dput(your_data[1:10, 1:8])` would give a copy/pasteable version of the first 10 rows and first 8 columns from your data. We just need a little sample to illustrate the problem. — Gregor Thomas, Oct 19 '20 at 17:35
Alternately, if you share R code to simulate a little bit of sample data, that works well too. — Gregor Thomas, Oct 19 '20 at 17:35

score 0 · Answer 1 · edited Oct 20 '20 at 02:18

0

I think you should give your try as well as an reproducible example with the question. Anyway, I think you could use the following to split data:

X<-split(iris, iris[,5])
print(X)

Here I tried to split data based on the iris-label axis.

For counting data, you could have a look at: count number of rows in a data frame in R based on group

edited Oct 20 '20 at 02:18

rmlockerd

3,776
2
15
25

answered Oct 19 '20 at 17:27

Tou Mou

1,270
5
16

1

Thanks! Sorry I am new to this platform and am not sure how to share reproducible data yet. – user4999605 Oct 19 '20 at 17:31
dput(your_dataframe_name) – Karthik S Oct 19 '20 at 17:32
@KarthikS I believe I added it - I took a very small subset of the data (actual data is over 1 mil rows long) – user4999605 Oct 19 '20 at 17:41
@user4999605 ,have a look at answers i posted ! – Tou Mou Oct 19 '20 at 17:44
@user4999605 , reproducible data means that you should give some data ( matrix , dataframe , etc ) and the wanted output to be able to reproduce suggested solution. – Tou Mou Oct 19 '20 at 17:50
1

@TouMou i added some to my question. I accidentally added it in the wrong place at first oops... – user4999605 Oct 19 '20 at 17:52
@user4999605 , ok . – Tou Mou Oct 19 '20 at 17:56

score 0 · Answer 2 · answered Oct 19 '20 at 17:37

0

To count data , you could think of a solution similar to the following :

X<-split(iris, iris[,5])
mydf=X[[1]]
mydf
aggregate(cbind(count = Petal.Length) ~ Petal.Length + Petal.Width + Sepal.Length, 
          data = mydf, 
          FUN = function(x){NROW(x)})

Another option with dplyr :

X<-split(iris, iris[,5])
mydf=X[[1]]
mydf
library(dplyr)
mydf %>%
    group_by(Sepal.Width,Sepal.Length) %>%
    summarise(count=n())

answered Oct 19 '20 at 17:37

Tou Mou

1,270
5
16

In the secod option , i aggregated data by (Sepal.Width,Sepal.Length) axis then i summarized results by counting possible combinations. – Tou Mou Oct 19 '20 at 17:48
1

This second option is really good, but actually, I dont want to count the number of each episode_ID, what I want to do is after it is all groups, remove any duplicate episode IDs (it doesnt matter which stay or not) and then count the number of UNIQUE episodes per county per state per event. – user4999605 Oct 19 '20 at 18:17
I had given another solution with the data you had given. I wish the problem is solved. – Tou Mou Oct 19 '20 at 18:19
To remove duplicate , use distinct() function of dplyr package. – Tou Mou Oct 19 '20 at 18:22
I think i replied to your question. If problem resolved , plz accept answer! – Tou Mou Oct 19 '20 at 18:33

Tou Mou · Answer 3 · 2020-10-19T19:18:45.793

As the data is based on list structure , i suffered to get the following :

df=data.frame(EPISODE_ID = c(1099647L, 1104972L, 1104973L, 1104971L, 1105463L, 1099373L, 1104298L, 1105465L, 1102576L, 1102576L, 1102576L, 1102581L, 1101019L, 1099291L, 1099292L, 1105466L, 1105467L), EVENT_ID = c(5176080, 5166972, 5166973, 5166971, 5167810, 5163387, 5161925, 5168217, 5171130, 5171131, 5171132, 5171137, 5173230, 5164469, 5164470, 5168218, 5168219), STATE = c("ALASKA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALASKA", "ALASKA", "ALASKA", "ALABAMA", "ALABAMA"), STATE_FIPS = c(2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1), YEAR = c(2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000), MONTH_NAME = c("August", "December", "December", "December", "December", "July", "November", "December", "September", "September", "September", "September", "September", "July", "July", "December", "December"), EVENT_TYPE = c("Flash Flood", "Tornado", "Tornado", "Tornado", "Tornado", "Flash Flood", "Tornado", "Tornado", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Tornado", "Tornado"), CZ_TYPE = c("C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C" ), CZ_FIPS = c(14, 115, 55, 125, 61, 73, 65, 69, 1, 101, 51, 111, 14, 7, 5, 45, 67), row.names = 1:17)

print(df)
print(typeof(df))
library(dplyr)
 
df %>%
    group_by(STATE,EVENT_TYPE) %>%
    summarise(count=n())

Output :

# Groups:   STATE [2]
  STATE   EVENT_TYPE  count
  <fct>   <fct>       <int>
1 ALABAMA Flash Flood     5
2 ALABAMA Tornado         8
3 ALASKA  Flash Flood     4

I didn't find a country axis as the question suggest . However , you could use the same logic of that answer . The df should be a dataframe.

Maybe better :

df=data.frame(EPISODE_ID = c(1099647L, 1104972L, 1104973L, 1104971L, 1105463L, 1099373L, 1104298L, 1105465L, 1102576L, 1102576L, 1102576L, 1102581L, 1101019L, 1099291L, 1099292L, 1105466L, 1105467L), EVENT_ID = c(5176080, 5166972, 5166973, 5166971, 5167810, 5163387, 5161925, 5168217, 5171130, 5171131, 5171132, 5171137, 5173230, 5164469, 5164470, 5168218, 5168219), STATE = c("ALASKA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALASKA", "ALASKA", "ALASKA", "ALABAMA", "ALABAMA"), STATE_FIPS = c(2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1), YEAR = c(2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000), MONTH_NAME = c("August", "December", "December", "December", "December", "July", "November", "December", "September", "September", "September", "September", "September", "July", "July", "December", "December"), EVENT_TYPE = c("Flash Flood", "Tornado", "Tornado", "Tornado", "Tornado", "Flash Flood", "Tornado", "Tornado", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Flash Flood", "Tornado", "Tornado"), CZ_TYPE = c("C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C" ), CZ_FIPS = c(14, 115, 55, 125, 61, 73, 65, 69, 1, 101, 51, 111, 14, 7, 5, 45, 67), row.names = 1:17)

print(df)
print(typeof(df))
library(dplyr)

splited_df=df %>%
  group_split(STATE)
length(splited_df)
splited_df
m=lapply(1:length(splited_df), function(i) splited_df[[i]] %>%
    group_by(STATE,EVENT_TYPE) %>%
    summarise(count=n()))
m

Output :

[[1]]
# A tibble: 2 x 3
# Groups:   STATE [1]
  STATE   EVENT_TYPE  count
  <fct>   <fct>       <int>
1 ALABAMA Flash Flood     5
2 ALABAMA Tornado         8

[[2]]
# A tibble: 1 x 3
# Groups:   STATE [1]
  STATE  EVENT_TYPE  count
  <fct>  <fct>       <int>
1 ALASKA Flash Flood     4

To convert from list to dataframe :

Assuming your list of lists is called l:

df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))

Aggregating and counting data in R

3 Answers3