-1

I have a data frame with 29 rows and 26 column with a lot of NA's. Data looks somewhat like shown below( working on R studio)

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
a1 b1 d1 d2 d3 d4 d5 na na e2
a1 b2 d2 d4 d1 e2 na e1 na na
a1 b3 d5 d3 d1 d4 na na e1 e2
a2 c1 e1 e2 na e3 na d2 d1 d4
a2 c2 d2 d4 d1 e2 na e1 na na
a2 c3 d5 d3 d1 d4 na na e1 e2

Here we have V1-V10 columns a1 and a2 are 2 distinct values in column 1 b1-b3 in column V2 are distinct values related to a1 in V1 column V3- V10 we have distinct values in each row

Result i want is as below-

NewV1 Newv2  NewV3
a1     3      7
a2     3      8

to summarize i want to get the (total count of values of V2 and the count of distinct values of V3-V10) based on V1

Abhinav Sharma
  • 45
  • 1
  • 2
  • 8

3 Answers3

1

looks like, no one can reproduce your results. Do you work case sensitive? if so, for dplyr approach you can try:

# import libraries and data
library(tidyverse)

df <- read.table(text = "V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
a1 b1 d1 d2 d3 d4 d5 NA NA E2
a1 b2 d2 d4 d1 E2 NA E1 NA NA
a1 b3 d5 d3 d1 d4 NA NA E1 E2
a2 c1 E1 E2 NA E3 NA D2 D1 D4
a2 c2 d2 d4 d1 E2 NA E1 NA NA
a2 c3 d5 d3 d1 d4 NA NA E1 E2", sep=" ", header = T, stringsAsFactors = F)

because you work case sensitive we add:

df <- data.frame(lapply(df, function(x) {
  if (is.character(x)) return(toupper(x))
  else return(x)
}))

now run:

 n_V2 <- df %>%
   gather(key, value, - V1, -V2) %>%
   group_by(V1) %>%
   distinct(V2) %>%
   summarise(Newv2=n())

 n_Vx <- df %>%
   gather(key, value, - V1, -V2) %>%
   filter(!is.na(value)) %>%
   group_by(V1) %>%
   distinct(value) %>%
   summarise(Newv3=n())  

 n_V2 %>% left_join(n_Vx)

that output is exactly yours:

 Joining, by = "V1"
 # A tibble: 2 x 3
   V1    Newv2 Newv3
   <fct> <int> <int>
 1 a1        3     7
 2 a2        3     8
Stephan
  • 2,056
  • 1
  • 9
  • 20
  • thank you for your help, but distinct values in V3-V10 based on a1 and a2 are 7(for a1) and 8(for a2) in total.. E1,E2,E3,d1,d2,d3,d4,d5... – Abhinav Sharma Mar 09 '18 at 11:59
  • ok, added a line from [here](https://stackoverflow.com/a/16516747/8598377) now it should work, try it – Stephan Mar 09 '18 at 12:15
  • Can this be done on data frame, importing a .csv file as i would have to work with same type of data but with more then 10k rows later on. How to import the complete data as here you have imported it as written text. – Abhinav Sharma Mar 12 '18 at 04:56
  • sure, you do `df <- read_delim(file)`. `read_delim` is a function from `readr`, a `tidyverse` package. it depends on your csv file, but with that function no problems should appear. for more information use [readr-documentation](http://readr.tidyverse.org/reference/read_delim.html) – Stephan Mar 12 '18 at 07:36
0

Following the description of your algorithm you can do with data.table:

library("data.table")

dt <- fread(
"V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
a1 b1 d1 d2 d3 d4 d5 NA NA E2
a1 b2 d2 d4 d1 E2 NA E1 NA NA
a1 b3 d5 d3 d1 d4 NA NA E1 E2
a2 c1 E1 E2 NA E3 NA D2 D1 D4
a2 c2 d2 d4 d1 E2 NA E1 NA NA
a2 c3 d5 d3 d1 d4 NA NA E1 E2")
dt[, .(nV2=length(V2), u3.10=uniqueN(na.omit(tolower(unlist(.SD))))), by=V1, .SDcols=3:10]
#    V1 nV2 u3.10
# 1: a1   3     7
# 2: a2   3     8

If you have a dataframe df you can coerce it into a data.table-object in place by:

setDT(df)
jogo
  • 12,469
  • 11
  • 37
  • 42
  • thank you for your help, but distinct values in V3-V10 based on a1 and a2 are 7(for a1) and 8(for a2) in total.. E1,E2,E3,d1,d2,d3,d4,d5... – Abhinav Sharma Mar 09 '18 at 12:02
  • Please put the description of the logic in your question, i.e. edit your question: https://stackoverflow.com/posts/49191499/edit Do you want to count `NA` as a distinct value. For a2 I count these: E1, E2, E3, NA, D2, D1, D4, d2, d4, d1, d5, d3 - that are 12 or (without NA) 11. Are D2 and d2 are distinct values? I edited my answer. – jogo Mar 09 '18 at 12:11
  • Now I can produce the desired result. – jogo Mar 09 '18 at 12:18
  • Can this be done on data frame, importing a .csv file as i would have to work with same type of data but with more then 10k rows later on. How to import the complete data as here you have imported it as written text. – Abhinav Sharma Mar 12 '18 at 04:57
  • `fread()` can read CSV-files: read the help pages of `fread()`. If you have a dataframe `df` you can coerce it to a data.table-object in place with `setDT(df)`. – jogo Mar 12 '18 at 07:24
  • Any possible way to get the value with maximum occurrence across columns 3:10 with respect to column V1 NOTE= it should not take NA into consideration – Abhinav Sharma Mar 16 '18 at 13:45
  • That is another question. – jogo Mar 16 '18 at 14:23
-1

You can do this using data.table:

library(data.table)
L3 <- LETTERS[1:3]
fac2 <- sample(L3, 10, replace = TRUE)
fac <- sample(L3, 10, replace = TRUE)
d <- data.frame(fac2 = fac2, fac = fac)
dd <- as.data.table(d)
dd[, num := length(unique(fac)), fac2]
dd[, mean(num), fac2]

Best regards

jovogt
  • 11
  • 2
  • thank you but can you elaborate it based on my data..bit confused to understand it. also will the approach work when i have same type of data but with 10k plus rows – Abhinav Sharma Mar 12 '18 at 04:58