Count if >0 into frequency distribution in R

Question

Say I have 900 dataframes at hand, and I wanted to get something similar to a frequency distribution based off of another column for each "type".

Sample Code makin;

df1 <- as_tibble(iris)
df2 <- slice(df1, 1:7) 
df2 <- df2 %>% 
  mutate(type = 1:7)

This is similar to what I currently have just working with one dataframe:

df2 %>% select(type, Sepal.Length) %>%
  mutate(Count = ifelse(Sepal.Length > 0, 1, 0)) %>%
  mutate(Percentage = Count/7)

In the case that for any row, Sepal.Length = 0, then I'm not going to count it (count column will be = 0 for that row value).

But I'm going to have 900 dataframes that I'll be running this code on, so I was thinking about running it through a loop.

Ideally, if two dataframes are inputted, and both have Sepal.Length values >0 for row 1, then I want the count to be 2 for row 1 / type 1. Is there a better way to approach this? And if I do go for the looping option then is there a way to combine all the dataframes to tell R that row 1 / type 1 has multiple > 0 values?

can you clarify which frequency you're looking for? The frequency of Sepal.Length > 1 grouped by type? — De Novo, Mar 06 '18 at 03:53
Frequency of `sepal.length > 0` or better yet `sepal.legnth != 0`, so if `sepal.length = 1` then the count would be = 1, if `sepal.length = 4.5` then count = 1 again, anything not 0 would count as 1 and so on. Ideally, when I put the dataframes together (say 3 dataframes for example) I need R to understand that Sepal.Length could become `c("4.1","2.1","0")` for type 1 in which case the total count would then be 2 where 1+1+0 for type 1 — S31, Mar 06 '18 at 04:01
Yes, sorry, accidentally typed 1 instead of 0. So you're not going to have a Sepal.length < 0, but for your data the variable of interest does have negative values. You just want `variable1 != 0` grouped by `variable2` across multiple data frames. I don't think you need to do any fancy looping to get this. Let me write something for you... — De Novo, Mar 06 '18 at 04:08

De Novo · Answer 1 · 2018-03-06T05:13:59.657

0

For your iris example, what it sounds like you want is:

library(tidyverse)
df1 <- as_tibble(iris)
df2 <- slice(df1, 1:7) 
df2 <- df2 %>% 
  mutate(type = 1:7)

group_by(df2, type) %>%
  transmute(has_sepal = sum(Sepal.Length > 0))
#   A tibble: 7 x 2
#   Groups:   type [7]
#    type has_sepal
#   <int>     <int>
# 1     1         1
# 2     2         1
# 3     3         1
# 4     4         1
# 5     5         1
# 6     6         1
# 7     7         1

To do this over 900 data frames... If you want this to work on iris, hard code. Someone who is familiar with writing functions using tidyverse evaluation could write a more general version for you, but that's still on my todo list.

f_fill_in_blank_first <- function(tib){
# hard code the var1 and var2
    group_by(tib, <var1>) %>%
      transmute(var1_not_zero = sum(<var 1> != 0))
}

f_iris <- function(tib)
    group_by(tib, type) %>%
        transmute(var1_not_zero = sum(Sepal.Length != 0)
}

Depending on the structure of your 900 data frames, you could vapply with this function (edit, no, not this function, refactor so it produces a named atomic vector if you want to vapply this function) to put the whole thing into an array, then collapse one of the dimensions with apply and sum

edited Mar 06 '18 at 05:13

answered Mar 06 '18 at 04:10

De Novo

7,120
1
23
39

Awesome - thanks for the quick response. The only thing is when I use the above example just for one, so I'll plug in `f(df2, Sepal.Length, type)` - I'll get this error `Error in grouped_df_impl(data, unname(vars), drop) : Column var2 is unknown` – S31 Mar 06 '18 at 04:28
whoops, yeah I forgot about the nonstandard tidy evaluation. Wrong environment. The easiest solution, since it's the same variable each time, would be to hard code it. – De Novo Mar 06 '18 at 04:37
Great! It gives me the correct output now. To follow-up though, it does say `Adding missing grouping variables: `type`` - any idea why it's not acknowledging a column that's there? – S31 Mar 06 '18 at 04:42
this uses functions in the [tidyverse](https://www.tidyverse.org/). They have a nonstandard evaluation that allows them to evaluate their arguments an an environment created by the first argument passed to them. That's why you can do `group_by(df2, type)` when in your global environment, if you try to use `type` R will tell you the object doesn't exist. It looks up type in an environment of df2. Standard evaluation would require you use `df2$type`, which is how you extract that column in the global environment. – De Novo Mar 06 '18 at 04:47
In the case of the function f, where this ends up being tricky is when you're already operating in a different environment. There is a way to write this so that the tidyverse functions use the value of a formal argument instead, and this will work, but I haven't learned how to do that yet. – De Novo Mar 06 '18 at 04:50
Got it. Makes a lot more sense. Another follow-up, having trouble with collapsing the function with an apply or sum - can you point me to an example or provide an example? Just can't seem to generate the output n time and condense into one summed table – S31 Mar 06 '18 at 04:58
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/166288/discussion-between-dan-hall-and-s31). – De Novo Mar 06 '18 at 05:02

Daniel Schütte · Answer 2 · 2018-03-06T05:28:26.993

If you want to keep your code:

df2 %>% select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0)) %>%
mutate(Percentage = Count/7)

You can wrap it into a function (add_a_count):

library(tidyverse)
df1 <- as_tibble(iris)
df2 <- df1 %>% 
    mutate(type = nrow(df1))

add_a_count = function(df)
{
counted_df = df %>% 
           select(type, Sepal.Length) %>%
           mutate(Count = ifelse(Sepal.Length > 0, 1, 0),
           Percentage = Count/7)
return(counted_df)
}

I generate 100 duplicates of the test df2 with the following function:

duplicate_df = function(df, no_duplicates)
{
tmp_df_list = list()
for(i in c(1:no_duplicates))
    {
    print(paste0("Duplicate ", i, " generated."))
    tmp_df_list[[i]] = df
    }
return(tmp_df_list)
}

data_frames_list = duplicate_df(df = df2, no_duplicates = 100)

And use it with lapply: counted_data_frames = lapply(data_frames_list, add_a_count)

The list counted_data_frames can relatively easily be manipulated (You can use another apply function if you want a non-list output). This might not be the fastest way to do it, but it's straightforward.

EDIT

You can get your Counts columns via looping over the list of data frames. A new data frame counts_data_frame contains all counts with every column being counts of one original data frame:

counts_data_frame = data.frame(type = seq(from = 1, to = nrow(df2)))

for(i in c(1:length(counted_data_frames)))
{
  counts_data_frame = cbind(counts_data_frame, as.vector(counted_data_frames[[i]]["Count"]))
}

When looping over the rows of this new data frame, you can sum up your counts and get a vector of counts for plotting:

counts_summarised = vector(length = nrow(counts_data_frame))

for(i in c(1:nrow(counts_data_frame)))
{
  counts_summarised[i] = sum(counts_data_frame[i, 2:ncol(counts_data_frame)])
}

plot(counts_summarised, ylab = "Counts", xlab = "Type")

So close. Is there a way to combine the final output into just one 7x4 dataframe? Instead of 100 7x4's? The last two columns are the ones that make sense to keep and sum. The goal is to be able to see amongst all these 100's of dataframe, how many Sepal.length's are there over 0 for each type? — S31, Mar 06 '18 at 04:56
If I understand you correctly, the edit I made to my answer should give you a plot displaying the distribution you are looking for. Again, there is definitely a faster way (probably by replacing `for` loops by `apply` functions). But this is the first thing I could come up with. — Daniel Schütte, Mar 06 '18 at 05:29

hpesoj626 · Answer 3 · 2018-03-06T05:49:02.320

In this solution, I will show you how to:

import all CSV files, into separate data frames in a list, assuming that they all have the same column name for the variable you are interested in and that the files are in one folder (your working directory, preferrably);
count the number of 0 and nonzero measurements and their proportions;
convert the list into a dataframe

Specifically, I used lapply() for looping through the data.frames, converting the list to a data.frame using enframe(), unnesting the value column with unnest(), and spreading the pct by type using spread().

Let's first create a data to work with.

library(tidyverse)

# create a list
datlist <- list()
# this list will contain ten data frames with 
# a sample with up to 8 0's and 20 random uniforms as observations 
for (i in seq_len(10)){
  datlist[[i]] = data.frame(x = sample(c(sample(c(0,1,2,3,4), 8, replace = T), runif(20,0,10))))
}
# name each element of the list datlist
name_element <- LETTERS[1:10]
datlist <- set_names(datlist, name_element)
# save each file separately
mapply(write.csv, datlist, file=paste0(names(datlist), '.csv'), row.names = FALSE)

The following will import your data into R and store them as data.frames in a list.

# import all csv files in the folder into separate data frames in the temp list
temp <- list.files(pattern = "*.csv")
myfiles <- lapply(temp, read.csv)

The following will calculate the percentages by type if we assume that each file contains the same variables.

# Calculate the frequency and relative distributions 
lapply(myfiles, 
       function(varname) mutate(varname, type = if_else(x == 0, 0, 1)) %>%
         group_by(type) %>% summarise(n = n()) %>% 
         mutate(pct = n / sum(n))
       ) %>% 
  enframe() %>% # convert the list into a data.frame
  unnest(value) %>% # unnest the values
  spread(type, pct) # spread the values by type


# A tibble: 17 x 4
    name     n     `0`    `1`
   <int> <int>   <dbl>  <dbl>
 1     1     3  0.107  NA    
 2     1    25 NA       0.893
 3     2    28 NA       1.00 
 4     3     1  0.0357 NA    
 5     3    27 NA       0.964
 6     4     2  0.0714 NA    
 7     4    26 NA       0.929
 8     5    28 NA       1.00 
 9     6    28 NA       1.00 
10     7     2  0.0714 NA    
11     7    26 NA       0.929
12     8     3  0.107  NA    
13     8    25 NA       0.893
14     9     1  0.0357 NA    
15     9    27 NA       0.964
16    10     1  0.0357 NA    
17    10    27 NA       0.964

Count if >0 into frequency distribution in R

3 Answers3