12

After importing a table from Wikipedia, I have a list of values of the following form:

    > tbl[2:6]
    $`Internet
    Explorer`
     [1] "30.71%" "30.78%" "31.23%" "32.08%" "32.70%" "32.85%" "32.04%" "32.31%" "32.12%" "34.07%" "34.81%"
    [12] "35.75%" "37.45%" "38.65%" "40.63%" "40.18%" "41.66%" "41.89%" "42.45%" "43.58%" "43.87%" "44.52%"

    $Chrome
     [1] "36.52%" "36.42%" "35.72%" "34.77%" "34.21%" "33.59%" "33.81%" "32.76%" "32.43%" "31.23%" "30.87%"
    [12] "29.84%" "28.40%" "27.27%" "25.69%" "25.00%" "23.61%" "23.16%" "22.14%" "20.65%" "19.36%" "18.29%"

I am trying to get rid of the percentage signs, in order to convert the data to numeric form.

Is there a quicker way to clean this data than going for a vectorization? My current code follows:

    data <- lapply(tbl[2:6], FUN = function(x) as.numeric(gsub("%", "", x)))

The data eventually become a data frame, but I could not get gsub to work properly across all elements of a data frame. Is there a way to gsub() each element of a data frame?

The code for the project is online, with results. Thanks in advance!

Fr.
  • 2,865
  • 2
  • 24
  • 44
  • 1
    That is more likely just a list than a dataframe. And ... lapply will also work with dataframes since they are actually lists with special attributes. – IRTFM Feb 14 '13 at 10:52
  • It is a list. But `gsub` does not work as I need it to on it (`lapply` works fine). – Fr. Feb 14 '13 at 10:55
  • 1
    Because data.frames are special lists and you have a tested method for lists, this would have almost surely worked: `dfrm <- as.data.frame(lapply(tbl[2:6], FUN = function(x) as.numeric(gsub("%", "", x))) )` – IRTFM Feb 14 '13 at 11:42
  • Indeed, that would work, but I am trying to go without vectorization, staying at the level of `as.` functions to get the data in shape for cleaning. Your argument is otherwise entirely correct. – Fr. Feb 14 '13 at 20:08
  • @BondedDust I used lapply with gsub on my data frame and all columns are now converted to factor. Trying to convert back to numeric and saw this post: http://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information Any other ideas? – vagabond Oct 24 '14 at 20:32
  • Eww. I think they started out as factors. I thought gsub would handle the factor to character automatically, but maybe the assignment back to a dataframe object reconverted to factor. It's always better to post dput(object) rather than the ambiguous display you see at the console. You can now do" `lapply(obj, function(x) as.numeric(as.character(x)) )` – IRTFM Oct 25 '14 at 00:04

3 Answers3

12

Well I think you could do it the following way, but I don't know if it is better or cleaner than yours :

df <- data.frame(tbl)
df[,-1] <- as.numeric(gsub("%", "", as.matrix(df[,-1])))

Which gives :

R> head(df)
            Date Internet.Explorer Chrome Firefox Safari Opera Mobile
1   January 2013             30.71  36.52   21.42   8.29  1.19  14.13
2  December 2012             30.78  36.42   21.89   7.92  1.26  14.55
3  November 2012             31.23  35.72   22.37   7.83  1.39  13.08
4   October 2012             32.08  34.77   22.32   7.81  1.63  12.30
5 September 2012             32.70  34.21   22.40   7.70  1.61  12.03
6    August 2012             32.85  33.59   22.85   7.39  1.63  11.78
R> sapply(df, class)
             Date Internet.Explorer            Chrome           Firefox 
         "factor"         "numeric"         "numeric"         "numeric" 
           Safari             Opera            Mobile 
        "numeric"         "numeric"         "numeric" 
juba
  • 47,631
  • 14
  • 113
  • 118
  • This works best for me, it is both shorter and easier to read. I have updated the code to acknowledge it. – Fr. Feb 14 '13 at 20:09
  • Ah well, thanks for the credits. I'll put you as co-atuhor of my package in return :) – juba Feb 15 '13 at 07:59
  • [off-topic] Thanks! I'm planning more functions like the one I submitted. Most of them are directly inspired by Stata commands that I find most useful to analyse surveys. [on-topic] It happens quite often to have a data frame where all columns but one are formatted the same way. I'm also thinking of coding a little routine that would work a bit like `melt` (with an `id.vars` argument) for these kinds of operations. – Fr. Feb 16 '13 at 01:57
4

Like juba I'm uncertain if this way is "better or cleaner" but...to act on all elements of a data frame, you can use apply:

# start with data frame, not list
url <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers"
# Get the eleventh table.
tbl <- readHTMLTable(url, which = 11, stringsAsFactors = F)

# use apply on the non-date columns
tbl[, 2:7] <- apply(tbl[, 2:7], 2, function(x) as.numeric(gsub("%", "", x)))
neilfws
  • 32,751
  • 5
  • 50
  • 63
0

I would do this by using a for-loop (I know people don't like loops that much but at least it doesn't touch your data structure):

 for (i in 1:length(tbl[2:6])) {
         tbl[,i] <- gsub("%", "", tbl[,i])
 }
micsky
  • 113
  • 12