-1

I'm very new to R and struggling with using it for basic data analysis.

If I load a table, how can I find the Top 10 values for every column, along with each value's frequency & count of appearance? In addition, I'd like to also find out the frequency of blanks.

Using "Forbes2000" from the "HSAUR" package...

    data("Forbes2000", package = "HSAUR")
    head(Forbes2000)

The data contains 8 columns, some of which ("rank", "name", "sales", etc.) is unique per row. However, some columns ("country", "category") are not unique.

So, for each column, I'd like to find out the top 10 unique values, their % frequency, and counts. In addition, if the column contains at least one blank/NULL, an additional row showing the same info. If each row is unique, limit the results to 10 rows.

So, something like... (numbers below made up)

   country              percentage   rank
   United States        85.35%       1
   United Kingdom       6.31%        2
   Canada               3.12%        3

   category             percentage   rank
   Banking              55.28%       1
   Conglomerates        20.75%       2
   Insurance            12.23%       3
   NULL                 3.32%        4
   Oil & gas operations 2.11%        5
   ...(etc)...

   sales                percentage   rank
   1234.56              0.05%        1
   987.65               0.05%        1
   986.32               0.05%        1
   822.12               0.05%        1
   ...(etc)...

I've looked around StackOverflow for a while and found a few ranking questions, they they were 2D in nature ( How to return 5 topmost values from vector in R? ), or for a single column (how to find the top N values by group or within category (groupwise) in an R data.frame ). I'm looking for a solution that is 3D in nature, as appending

    names(Forbes2000)

doesn't seem to work to loop through all the columns.

Community
  • 1
  • 1
Markian Zadony
  • 33
  • 1
  • 1
  • 6
  • 3
    Write a little function `foo` that does what you want for one column, then `lapply(Forbes2000, foo)` will apply it to every column and return the results in a nice list. – Gregor Thomas Feb 28 '17 at 19:14
  • 1
    I was going to write the same comment that Gregor just did. I think it is a good exercise, as these are necessary and basic R skills - even necessary and basic skills in any language used for data analysis. – Mike Wise Feb 28 '17 at 19:16
  • If `lapply` seems too weird, write a for loop over the columns. And have a look at a good online R text like Hadley's Advanced R. – Mike Wise Feb 28 '17 at 19:17
  • Thanks for the replies. I wish this wouldn't have been marked "too broad". I'm trying to get over the "activation energy" (hump) that I need to get over in using R to see how to do this. Unfortunately, I didn't see much on how to do multiple things (rank & percentage) to multiple fields in R, hence why I included the SO links above... – Markian Zadony Mar 01 '17 at 19:37

1 Answers1

0

Something like this?

library("HSAUR")
f<-function(x){
Freq<-(head(sort(table(x),decreasing=TRUE)*100/length(x),10))
rank<-1:10
rank<-rank-cumsum(duplicated(Freq))
data.frame(perc=paste(Freq,"%",sep=""),rank)
}
lapply(Forbes2000,f)
Federico Manigrasso
  • 1,130
  • 1
  • 7
  • 11