1

I have dataset more than 1000000 cells and I only need the data with more than 20 days in 1 year using R based on the site name.

This is the original dataset:

enter image description here

I can use 'count if' function in excel to calculate how many days are in a month, but I found it extremely slow and causing problems. Here is the formula that I used on Excel.

My code on Excel:

enter image description here

I'm sorry for my basic question, but I hope there is someone who can help me with this matters.

ekoam
  • 8,744
  • 1
  • 9
  • 22
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. We don't need all your real data, just something to test with. Please don't post images of data or code. We can't copy/paste those to edit them or import them. – MrFlick Oct 20 '20 at 03:55
  • Which version of excel do you use? If you have `Office365` then it can be speedup significantly. – Harun24hr Oct 20 '20 at 04:27
  • Is this just `df %>% count(site, Year, Month)`? – Rui Barradas Oct 20 '20 at 06:36
  • thank you! its my first time post something here, will definitely include sample next time! – Gerry Andhika Oct 21 '20 at 06:29

3 Answers3

2

If I understand correctly you want the number of days per site. If your data looks like this:

site = c("a", "b", "c", "a", "b", "c", "a", "b", "c")
year = c(1991, 1992, 1993, 1991, 1992, 1993, 1991, 1992, 1993)
month = c(1, 1, 4, 4, 1, 1, 4, 4, 1)
my_data = data.frame(site, year, month)

You can use the package dplyr (install via install.packages(dplyr)):

library(dplyr)
my_data %>% group_by(site) %>% count(year, month)

Output:

# A tibble: 6 x 4
# Groups:   site [3]
  site   year month     n
  <chr> <dbl> <dbl> <int>
1 a      1991     1     1
2 a      1991     4     2
3 b      1992     1     2
4 b      1992     4     1
5 c      1993     1     2
6 c      1993     4     1

You can post a snippet of your code using dput:

dput(my_data)
structure(list(site = c("a", "b", "c", "a", "b", "c", "a", "b", 
"c"), year = c(1991, 1992, 1993, 1991, 1992, 1993, 1991, 1992, 
1993), month = c(1, 1, 4, 4, 1, 1, 4, 4, 1)), class = "data.frame", row.names = c(NA, 
-9L))

Then other people can take that the above code and reproduce your data. It doesn't have to be ALL your data, just the first ~10 rows.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Amar
  • 1,340
  • 1
  • 8
  • 20
1

Agree with comment - you need to create a reproducible example. Anything in Excel can be done better in R (well, almost anything) Many methods possible - try grouping as below

df = data.frame(matrix(rnorm(40), nrow=10))
X0 <-sample(c(rep("A",3),rep("B",3),rep("C",4)), 10)
df$X0 <- x0
df        # this is a simulated dataset
           X1          X2         X3          X4 X0
1  -0.1203975  0.40176671 -0.1441914 -1.27356811  B
2  -1.2945289  0.34813784  0.3233932  0.80881055  B
3   1.0095911  0.69446583 -0.4377096 -1.41026705  A
4  -0.2547187  1.04914805 -0.8120526 -0.58734365  C
5  -0.1732585 -0.16422459 -0.8195833  0.44932410  A
6   0.5599036 -0.79013084 -0.9540998 -0.03861297  C
7  -0.5338999  0.03481498  0.3663478  0.44230402  B
8   0.9837636  0.98619889 -0.4901891  0.45207061  C
9   0.7892718 -0.23403087 -0.1102857  0.79010878  A
10  0.9392468  0.54867442 -0.4180297 -0.40482425  C


# Next, we count/index by X0 ~ you will need to do this by "site" /your choice
library(dplyr)
df1 <- df %>%
  group_by(X0) %>%
  add_count
df1

# A tibble: 10 x 6
# Groups:   X0 [3]
       X1      X2     X3      X4 X0        n
    <dbl>   <dbl>  <dbl>   <dbl> <chr> <int>
 1 -0.120  0.402  -0.144 -1.27   B         3
 2 -1.29   0.348   0.323  0.809  B         3
 3  1.01   0.694  -0.438 -1.41   A         3
 4 -0.255  1.05   -0.812 -0.587  C         4
 5 -0.173 -0.164  -0.820  0.449  A         3
 6  0.560 -0.790  -0.954 -0.0386 C         4
 7 -0.534  0.0348  0.366  0.442  B         3
 8  0.984  0.986  -0.490  0.452  C         4
 9  0.789 -0.234  -0.110  0.790  A         3
10  0.939  0.549  -0.418 -0.405  C         4
# After the above, you can sort and cut OR filter your count variable 
by 20 or any such value! Here, rows with n >3 are subset. Done.     

df2 <- df1[df1$n>3,]
df2
# A tibble: 4 x 6
# Groups:   X0 [1]
      X1     X2     X3      X4 X0        n
   <dbl>  <dbl>  <dbl>   <dbl> <chr> <int>
1 -0.255  1.05  -0.812 -0.587  C         4
2  0.560 -0.790 -0.954 -0.0386 C         4
3  0.984  0.986 -0.490  0.452  C         4
4  0.939  0.549 -0.418 -0.405  C         4
aiatay7n
  • 182
  • 1
  • 7
-2

I will suggest you create an new column to CONCATENATE B, J and K columns, and then paste as value the column. After that, you can simply use countif instead of countifs.

Victor19
  • 97
  • 8