0

I have a big data frame of many variables and their options, so I want the count of all variables and their options. for example the data frame below.

also I have same another data frame and if I want to merge these two data frame, to check if the names of column are same , if not the get the names of different column names.

Excluding c(uniqueid,name) column the objective is to find if we have any misspelled words with the help of count, or the words have any accent.

df11 <- data.frame(uniqueid=c(9143,2357,4339,8927,9149,4285,2683,8217,3702,7857,3255,4262,8501,7111,2681,6970),
                    name=c("xly,mnn","xab,Lan","mhy,mun","vgtu,mmc","ftu,sdh","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","sghu,njui","sgyu,hytb","vdti,kula","mftyu,huta","mhuk,ghul","cday,bhsue","ajtu,nudj"),
                    city=c("A","B","C","C","D","F","S","C","E","S","A","B","W","S","C","A"),
                    age=c(22,45,67,34,43,22,34,43,34,52,37,44,41,40,39,30),
                    country=c("usa","USA","AUSTRALI","AUSTRALIA","uk","UK","SPAIN","SPAIN","CHINA","CHINA","BRAZIL","BRASIL","CHILE","USA","CANADA","UK"),
                    language=c("ENGLISH(US)","ENGLISH(US)","ENGLISH","ENGLISH","ENGLISH(UK)","ENGLISH(UK)","SPANISH","SPANISH","CHINESE","CHINESE","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH(US)"),
                    gender=c("MALE","FEMALE","male","m","f","MALE","FEMALE","f","FEMALE","MALE","MALE","MALE","FEMALE","FEMALE","MALE","MALE"))

the output should be like a summary of count of group of variables and their options. its a kind of Pivot for Eg: for city so it should select all available columns in data frame and the give kind of summary of counts for all options available in columns

enter image description here

sanuali0123
  • 133
  • 5
  • 2
    Your input data is not reproducible at the moment. Also It is not clear what your expected output is. – s_baldur Sep 22 '20 at 08:33
  • Have you tryied things with `unique()`, `lenght()`, `base::intersect()`, `attributes()`, `str()`, `names()`, etc.? – Paul Sep 22 '20 at 09:18
  • i just updated question, kinda new to R – sanuali0123 Sep 22 '20 at 09:28
  • You can look at ["tidyverse" options](https://r4ds.had.co.nz/) or, if you just want quick results, you can try things such as `apply(df11[, 2:ncol(df11)], 2, table)` – Paul Sep 22 '20 at 09:37
  • For better display: `apply(df11[, 2:ncol(df11)], 2, function(x) as.data.frame(table(x)))` – Paul Sep 22 '20 at 09:47
  • but i want column wise summary – sanuali0123 Sep 22 '20 at 09:47
  • I am not sure to understand, these lines give you the same content you show with the picture of your expected output. See the console output of `apply(df11[, 2:ncol(df11)], 2, function(x) as.data.frame(table(x)))["city"]` – Paul Sep 22 '20 at 09:52
  • If you want to make it "easier" to read and use, you might want to look at `dplyr` and `tidyr` packages and functions `group_by()`, `summarise()`, `pivot_longer()` – Paul Sep 22 '20 at 09:56
  • actually that is just a example for city column, so i want export the summary in excel like , in first column excel should be all columns name , then second column will be options of headers in data frame, and third column will be like count of options in data frame. – sanuali0123 Sep 22 '20 at 10:00

2 Answers2

1

You could count calculate the length of unique values in an aggregate.

res <- aggregate(. ~ city, df11, function(x) length(unique(x)))
res
#   city uniqueid name age country language gender
# 1    A        3    3   3       3        2      1
# 2    B        2    2   2       2        2      2
# 3    C        4    4   4       4        2      4
# 4    D        1    1   1       1        1      1
# 5    E        1    1   1       1        1      1
# 6    F        1    1   1       1        1      1
# 7    S        3    3   3       3        3      2
# 8    W        1    1   1       1        1      1
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • actually that is just a example for city column, so i want export the summary in excel like , in first column excel should be all columns name , then second column will be options of headers in data frame, and third column will be like count of options in data. – sanuali0123 Sep 22 '20 at 10:40
  • @sanuali0123 Yeah, that's similar to what you commented above. Could you elaborate on your expected output? Whats wrong when exporting by doing `xlsx::write.xlsx(res, "res.xlsx")`? – jay.sf Sep 22 '20 at 10:47
  • @sanuali0123 Also, if you look at `?reshape()` you will see how to pivot this table from long to wide format (with "name", "age", etc. as values of a column "variables"). Please show what you have tried so far... – Paul Sep 22 '20 at 10:51
1

I am quite confused with what you call "option" but here is something to start with using only base R functions.

Note: it only refers to the 1st part of the question "I want the count of all variables and their options".

res <- do.call(rbind, lapply(df11[, 3:ncol(df11)], function(option) as.data.frame(table(option)))) # apply table() to the selected columns and gather the output in a dataframe
res$variable <- gsub("[.](.*)","", rownames(res)) # recover the name of the variable from the row names with a regular expression
rownames(res) <- NULL # just to clean
res <- res[, c(3,1,2)] # ordering columns
res <- res[order(-res$Freq), ] # sorting by decreasing Freq

The output:

> res
   variable      option Freq
34 language     ENGLISH    7
42   gender        MALE    7
39   gender      FEMALE    5
3      city           C    4
1      city           A    3
7      city           S    3
11      age          34    3
36 language ENGLISH(US)    3
2      city           B    2
9       age          22    2
16      age          43    2
27  country       CHINA    2
28  country       SPAIN    2
30  country          UK    2
32  country         USA    2
33 language     CHINESE    2
35 language ENGLISH(UK)    2
37 language     SPANISH    2
38   gender           f    2
4      city           D    1
5      city           E    1
6      city           F    1
8      city           W    1
10      age          30    1
12      age          37    1
13      age          39    1
14      age          40    1
15      age          41    1
17      age          44    1
18      age          45    1
19      age          52    1
20      age          67    1
21  country    AUSTRALI    1
22  country   AUSTRALIA    1
23  country      BRASIL    1
24  country      BRAZIL    1
25  country      CANADA    1
26  country       CHILE    1
29  country          uk    1
31  country         usa    1
40   gender           m    1
41   gender        male    1
Paul
  • 2,850
  • 1
  • 12
  • 37
  • this works perfectly as I required, but one more thing if i want to sort freq for all variables in desc then....??? – sanuali0123 Sep 24 '20 at 10:20
  • one more thing if I want unselect a list of column I mean to ignore those columns . to unselect in df11[, 3:ncol(df11). may be there are columns with serial numbers or random numbers between 3:ncol(df11) so I want to uncheck those by giving a list of columns – sanuali0123 Sep 24 '20 at 10:31
  • @sanuali0123 added the code to sort the data. If you want to select only certain columns, you can either pass a vector of indices (e.g. `c(1, 2, 5)`) or a vector with the column names you want to keep (e.g. `c("variable1", "variable5")`. They are many ways to do it in fact, you may want to have a look [here](https://stackoverflow.com/questions/4605206/drop-data-frame-columns-by-name/4605268#4605268) – Paul Sep 24 '20 at 11:20
  • @sanuali0123 Here some more tips about [the R index system](https://rspatial.org/intr/4-indexing.html). Please google it around, this is fundamental to work with R objects. Also check: https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf (parts 3.5.4 and 3.5.5). – Paul Sep 24 '20 at 11:26
  • for sorting by groupby(variable) giving error, what i am doing wrong res <- group_by(res[res$variable]) %>% res[order(-res$Freq), ] – sanuali0123 Sep 24 '20 at 11:39
  • @sanuali0123 If you use `dplyr` package just do: `res <- res %>% group_by(variable) %>% arrange(-Freq) ` . With `dplyr` syntax, you don't need to use `[ ]` or `$` when referring to columns of `.data` (here `res`) – Paul Sep 24 '20 at 11:57