Names of R's available packages

Question

I'm eager to know,

how many package names on CRAN have two, three, N characters?
which combinations have not yet been used ("unpoppler")
how many package names use full-caps, or camelCase?
how many package names end in 2?

I think it might reveal some interesting facts.

Edit: bonus points for animated graphics showing the time-evolution of CRAN packages.

It's an interesting question, but I'm not sure if it's really SO-style. Should just be a matter of scraping the names off http://cran.r-project.org/web/packages/available_packages_by_name.html and running a few regexes on them, though. — Owen, Sep 11 '11 at 23:28
@Owen I agree on both points, but curiosity's sake won me over. — baptiste, Sep 11 '11 at 23:36
Scraping is overkill: just look at `myList[,"Package"]` where `myList <- available.packages()`. This list is subject to change every day. — Iterator, Sep 12 '11 at 00:03
As for the time evolution, here is a database with API: https://github.com/metacran/crandb (Disclaimer: I am the author of it.) It has some incorrect data, in particular dates of archivals are often wrong. Some of these I can and will fix, but for some, there is just no information available AFAIK. — Gabor Csardi, Sep 09 '14 at 02:00

score 14 · Accepted Answer · answered Sep 12 '11 at 09:45

A better way than scraping a web page to get the names of packages is to use the available.packages() function and process those results. available.packages() returns a matrix contains details of all packages available (but is filtered by default — see the Details section of ?available.packages for more).

pkgs <- available.packages(filters = "duplicates")
nameCount <- unname(nchar(pkgs[, "Package"]))
table(nameCount)

> table(nameCount)
nameCount
  2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21 
 32 311 374 360 434 445 368 277 199 132  99  56  56  43  22  19  18   2  12   8 
 22  24  25  31 
  5   2   1   1

Using nameCount we can select packages with names containing any number of characters without needing to resort to regexp etc:

> unname(pkgs[which(nameCount == 2), "Package"])
 [1] "BB" "bs" "ca" "cg" "dr" "ez" "FD" "ff" "HH" "HI" "iv" "JM" "ks" "M3" "mi"
[16] "np" "oc" "oz" "PK" "PP" "qp" "QT" "RC" "rv" "Rz" "sm" "sn" "sp" "st" "SV"
[31] "tm" "wq"

score 10 · Answer 2 · answered Sep 12 '11 at 09:45

10

here's one shot based on various suggestions.

 packages <- available.packages()[,'Package']

 ggplot(data.frame(n = nchar(packages))) +
   geom_histogram(aes(n), binwidth=1)

histogram

 all <- length(packages)
 ## 3168
 up <- sum(toupper(packages) == packages)
 ## 262
 low <- sum(tolower(packages) == packages)
 ## 1697
 pie(c(up, low, all-up-low), labels=c("UPPERCASE","lowercase","cAmElCaSe"))

pie

 let <- sapply(sapply(letters, grep, tolower(packages)), length)
 barplot(let)

barplot

 length(packages[grep("2$", packages, perl=TRUE)])
 # 29

answered Sep 12 '11 at 09:45

baptiste

75,767
19
198
294

Also interesting: how many packages contain dots? `packages[grepl("\\.", packages)]` – Richie Cotton Sep 12 '11 at 11:21
1

In your pie chart, "cAmElCaSe" also includes those with only the first letter capitalized, which I don't think of as CamelCase. – Brian Diggs Sep 12 '11 at 13:53

score 5 · Answer 3 · answered Sep 11 '11 at 23:43

5

Here is a short piece of code to answer some questions. I will keep adding to my answer when I find time.

library(XML); library(ggplot2);

url = 'http://cran.r-project.org/web/packages/available_packages_by_name.html'
packages = readHTMLTable(url, stringsAsFactors = F)[[1]][-1,]

# histogram of number of characters in package name
qplot(nchar(V1), data = packages)

answered Sep 11 '11 at 23:43

Ramnath

54,439
16
125
152

Nice, `RthroughExcelWorkbooksInstaller2` anyone? :) – baptiste Sep 11 '11 at 23:52
+1, though this might slightly overstate the number with two characters as it is picking up some `NA` values (e.g. `packages[128,]`) when the first letter changes, perhaps from html section names. – Henry Sep 12 '11 at 06:49
yes. but i think the `available.packages` solution is more elegant and robust. – Ramnath Sep 12 '11 at 13:47

score 1 · Answer 4 · answered Sep 12 '11 at 09:05

1

Make a vector of all the packages using

myList <- available.packages()[,'Package']

Then you can analyze however you want. For example, a list of packages with just two character names

myList[grep('^..$', myList)]

answered Sep 12 '11 at 09:05

adamleerich

5,741
2
18
20

`nchar` is faster and more readable for counting the number of characters in a string. `myList[nchar(myList) == 2]` – Richie Cotton Sep 12 '11 at 11:11
Also, use `grepl` rather than `grep` for indexing. If you have an empty match, `grep` doesn't do what you think it does. – Richie Cotton Sep 12 '11 at 11:14

Names of R's available packages

4 Answers4

Linked

Related