0

I have 25 data sets each is structured the same. Each contains many rows and 7 columns. Column 6 contains data that should be numerical but is not numerical. They are not numerical because the numbers contain commas i.e. 100000 is 100,000.

I can manually resolve this in each data set by removing the comma and then specifying that the data is numerical using the following code

df$column_6 <- gsub("[,]" , "", df$column_6)
df$column_6 <- as.numerical(df$column_6)

However as there are 25 data sets I would like to loop through them doing this however I am unable to do this.

Additionally because column 6 has a different name in each data set I would prefer to specify column 6 without using its name like below

df[6] <- gsub("[,]" , "", df[6])

however this doesn't seem to work.

My code is as follows

list_of_dfs = c(df1, df2, ..... , df25)

for (i in list_of_dfs) {
  i[6] <- gsub("[,]" , "", i[6])
  i[6] <- as.numerical(i[6])
}

Does anyone have any advice on how to do this

user9903833
  • 137
  • 6

4 Answers4

2

Your code is close, but has a few problems:

  • the result never gets assigned back the the list.
  • as.numerical is a typo, it needs to be as.numeric
  • i[6] doesn't work because you need to specify that it's the 6th column you want: i[, 6]. See here for details on [ vs [[.
  • c(df1, df2) doesn't actually create a list of data frames

Try this instead:

## this is bad, it will make a single list of columns, not of data frames
# list_of_dfs = c(df1, df2, ..... , df25)

# use this instead
list_of_dfs = list(df1, df2, ..... , df25)
# or this
list_of_dfs = mget(ls(pattern = "df"))

for (i in seq_along(list_of_dfs)) {
  list_of_dfs[[i]][, 6] <- as.numeric(gsub("[,]" , "", list_of_dfs[[i]][, 6]))
}

We can do a bit better, gsub uses pattern-matching regular expressions by default, using the fixed = TRUE argument instead will be quite a bit faster:

for (i in seq_along(list_of_dfs)) {
  list_of_dfs[[i]][, 6] <- as.numeric(gsub(",", "", list_of_dfs[[i]][, 6], fixed = TRUE))
}

And we could use lapply instead of a for loop for slightly shorter code:

list_of_dfs[[i]] <- lapply(list_of_dfs, function(x) {
    x[, 6] = as.numeric(gsub("," , "", x[, 6], fixed = TRUE))
    return(x)
})
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Hi Gregor Thank you for a really comprehensive answer. I was just hoping I could ask you a follow up question. I have noticed that this method doesn't directly effect the original data set but applies the desired result to the list object. such that View(list_of_dfs[[1]][[6]]) shows the data as I want but View(df1[,6]) shows still shows the original data set. Is this normal and if so is there a way to destructively edit df1 so that View(list_of_dfs[[1]][[6]]) and View(df1[,6]) show the same thing. – user9903833 Sep 20 '18 at 16:41
  • No. When you make a list of data frames, they are copies and are not linked to the original. You should remove the originals (or never create them in the first place, create the list directly from the source files) and just work with the list. – Gregor Thomas Sep 20 '18 at 17:03
1

Try this out. You put all dataframes in a list, then you make the column numeric. Instead of gsub I use readr::parse_number. I'll also include a practice set for illustration.

library(tidyverse)

df1 <- data_frame(id = rep(1,3), num = c("10,000", "11,000", "12,000"))
df2 <- data_frame(id = rep(2,3), num = c("13,000", "14,000", "15,000"))
df3 <- data_frame(id = rep(3,3), num = c("16,000", "17,000", "18,000"))

list(df1, df2, df3) %>% map(~mutate(.x, num = parse_number(num)))
#> [[1]]
#> # A tibble: 3 x 2
#>      id   num
#>   <dbl> <dbl>
#> 1     1 10000
#> 2     1 11000
#> 3     1 12000
#> 
#> [[2]]
#> # A tibble: 3 x 2
#>      id   num
#>   <dbl> <dbl>
#> 1     2 13000
#> 2     2 14000
#> 3     2 15000
#> 
#> [[3]]
#> # A tibble: 3 x 2
#>      id   num
#>   <dbl> <dbl>
#> 1     3 16000
#> 2     3 17000
#> 3     3 18000

Created on 2018-09-20 by the reprex package (v0.2.0).

AndS.
  • 7,748
  • 2
  • 12
  • 17
0

Part of the answer has been sourced from here: Looping through list of data frames in R

In your case, you can do the following:

list_of_dfs = list(df1, df2, ..... , df25)
lapply(list_of_dfs, function(x) { x[, 6] <- as.integer(gsub("," , "", x[, 6])) })
SmitM
  • 1,366
  • 1
  • 8
  • 14
  • 1
    You need to assign new or old list object to `lapply` and return `x` inside function. – Parfait Sep 20 '18 at 15:01
  • This code has the desired effect on the data as it is displayed in the console but it doesn't then get saved to the data frame such that when I type View(df1) the data hasn't been updated. How could I get around this? – user9903833 Sep 20 '18 at 15:32
0

The data table way

test<-data.table(col1=c('100,00','100','100,000'),col2=c('90','80,00','60'))
    col1  col2
 100,00    90
 100      80,00
 100,000  60

your list of data frames

testList<-list(test,test)

assume u want to correct col2 in this case but want to use index as reference

removeNonnumeric<-function(x){return(as.numeric(gsub(',','',x)))}
data<-function(x){return(x[,lapply(.SD,removeNonnumeric),.SDcols=names(x)[2],by=col1])}

removeNonnumeirc removes the "," from the columns and data accesses each data table in the testList and calls "removeNonnumeric" on them output is a list of data tables which is created by merging these 2 functions in an "lapply"

 lapply(testList,data)
rahul
  • 561
  • 1
  • 5
  • 13