0

i have about 30 columns within a dataframe of over 100 columns. the file i am reading in stores its numbers as characters. In other words 1300 is 1,300 and R thinks it is a character.

I am trying to fix that issue by replacing the "," with nothing and turn the field into an integer. I do not want to use gsub on each column that has the issue. I would rather store the columns as a vector that have the issue and do one function or loop with all the columns.

I have tried using lapply, but am not sure what to put as the "x" variable.

Here is my function with the error below it

ItemStats_2014[intColList] <- lapply(ItemStats_2014[intColList],
    as.integer(gsub(",", "", ItemStats_2014[intColList])) )

Error in [.data.table(ItemStats_2014, intColList) : When i is a data.table (or character vector), the columns to join by must be specified either using 'on=' argument (see ?data.table) or by keying x (i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

smci
  • 32,567
  • 20
  • 113
  • 146
alexb523
  • 718
  • 2
  • 9
  • 26
  • Just don't read in integers as strings in the first place, read them in as integers; `fread()` understands comma as decimal-separator – smci Apr 05 '18 at 20:36
  • For the future, as far as `lapply` itself goes, you can define the function in the `lapply` call like this: `lapply(ItemStats_2014[intColList], function(x) as.integer(gsub(",", "", x)))` – Edward Carney Apr 05 '18 at 20:42
  • Hey, I gave you the correct programmatic answer, including how to programmatically find colClasses without pain, even on 100-column dfs. This is scalable and does not leak memory and waste CPU unlike unnecessarily reading integers in as strings then converting. Look at gc() to see how much memory you waste on number strings. – smci Apr 05 '18 at 21:59

2 Answers2

3

The file I am reading in stores its numbers as characters [with commas as decimal separator]

Just directly read those columns in as decimal, not as string: data.table::fread() understands decimal separators: dec=',' by default.

You might need to play with fread(..., colClasses=(...) ) argument a bit to specify the integer columns:

myColClasses <- rep('string',100) # for example...
myColClasses[intColList] <- 'integer'
# ...any other colClass fixup as needed...
ItemStats_2014 <- fread('your.csv', colClasses=myColClasses)

This approach is simpler and faster and uses much less memory than reading as string, then converting later.

smci
  • 32,567
  • 20
  • 113
  • 146
  • this is good intel, and would probably work and i like the answer, but there is a ton of columns that i would spend quite a big of time sorting that out. – alexb523 Apr 05 '18 at 21:01
  • 1
    @alexb523: so just do a trial `fread(..., nrows=100)`. Then use `myColClasses <- sapply(df, class)` to get a named vector of the colClasses. You can use `as.vector()` and `names()` to get the components respectively. Fix it up as desired, put it in your code, then reread with `fread('your.csv', colClasses=myColClasses)`. Trust me it will make your life sane. It won't take time and you can do it programmatically. (Btw, CSV columns typically have naming conventions such that you can deduce their types from their names, with a `grep` or `grepl` or `gsub()`) – smci Apr 05 '18 at 21:50
  • There's no 'probably', I've been doing this for years. Just learn how to do it once and never look back. – smci Apr 05 '18 at 21:54
  • 1
    Helpful: [Programmatically determine the data types of a data frame's columns](https://stackoverflow.com/questions/21125222/programmatically-determine-the-data-types-of-a-data-frames-columns) – smci Apr 05 '18 at 22:04
  • thanks! i have been looking for away to call out just the column name and class. works great. – alexb523 Apr 05 '18 at 22:27
1

Try using dplyr::mutate_at() to select multiple columns and apply a transformation to them.

ItemStats_2014 <- ItemStats_2014 %>%
    mutate_at(intColList, funs(as.integer(gsub(',', '', .))))

mutate_at selects columns from a list or using a dplyr selector function (see ?select_helpers) then applies one or more functions to each column. The . in gsub refers to each selected column that mutate_at passes to it. You can think of it as the x in function(x) ....

divibisan
  • 11,659
  • 11
  • 40
  • 58