1

I have a UTF-16 Unicode Text (.txt) file downloaded and defaulted as comma-separated values (.csv) when saved on a mac drive. This file contains numeric data that has 1000 separator applied for numbers that are greater than 1,000. When loading in R, this data is in character class. In order to convert to numeric class, I do the following:

tx <- read.table("/Users/username/Desktop/report.csv",sep="\t", dec = ".", fileEncoding = "UTF-16LE", fill = T, skip=1 , quote="", header=T, stringsAsFactors = FALSE)

tx$Cost <- gsub("\\,", replacement = "", x = tx$Cost)

tx$Cost <- as.numeric(tx$Cost)
Warning message:
NAs introduced by coercion 

When summarized using the head(subset()) function, the following is the result where I'm still not able to convert into numeric class:

       **Orig after_gsub as.numeric**
1      95.31      95.31      95.31
2     992.77     992.77     992.77
3 "1,719.68"  "1719.68"         NA
4 "3,135.79"  "3135.79"         NA
5     111.91     111.91     111.91
6     305.12     305.12     305.12

Can someone help me convert them into numeric class using as.numeric()?

ed0reddie
  • 143
  • 2
  • 13
  • I do not get any error when I take the same steps after defining: tx=data.frame(Cost=c('100','100.1','1,999.5',NA)) – Naisheel Verdhan Aug 01 '14 at 20:35
  • @BondedDust: I'm not exactly sure how to apply colClasses in my situation, could you give me a hint specific to this situation? I can test and provide feedback. Thanks all! – ed0reddie Aug 02 '14 at 00:18
  • @beginneR: the combined as.numeric(gsub()) function doesn't seem to fix. :( – ed0reddie Aug 02 '14 at 00:21
  • @eN-V: if you load a csv with 1000 separator formatting applied to a column, you might be able to duplicate the scenario; instead of assigning a df with these values on GUI. Hopefully this helps you duplicate! – ed0reddie Aug 02 '14 at 00:22
  • Why not put up these data, or a sample of it and let us try it rather than guess! – Brandon Bertelsen Aug 02 '14 at 00:32

3 Answers3

3

The worked example using setClass, setAs and colClasses:

 library(methods)
  setClass("chr.w.commas", contains=numeric())
  setAs("character", "chr.w.commas", function(from) 
                              as.numeric(gsub("\\,", "",from )) )
 dat <- read.table(text="Orig after_gsub num
 1      '95.31'      '95.31'      '95.31'
 2     992.77     992.77     992.77
 3 '1,719.68'  '1719.68' NA
 4 '3,135.79'  '3135.79' NA
 5     111.91 111.91 111.91
 6     305.12     305.12     305.12", header=TRUE, colClasses="chr.w.commas")
 str(dat)
'data.frame':   6 obs. of  3 variables:
 $ Orig      : num  95.3 992.8 1719.7 3135.8 111.9 ...
 $ after_gsub: num  95.3 992.8 1719.7 3135.8 111.9 ...
 $ num       : num  95.3 992.8 NA NA 111.9 ...
IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

Thank you everyone who helped here. I actually found that my load function was the problem, and the following code does a simple trick to read in data correctly from the start.

read.csv(filename, sep="\t", fileEncoding="UTF-16", skip=1)    
ed0reddie
  • 143
  • 2
  • 13
0

I suspect that gsub doesn't work right on your UTF-16 strings. Perhaps you should convert the strings before doing the substitution. Try the following:

tx <- read.table("/Users/username/Desktop/report.csv",sep="\t", dec = ".", fileEncoding = "UTF-16LE", fill = T, skip=1 , quote="", header=T, stringsAsFactors = FALSE)
tx$Cost <- iconv(tx$Cost,"UTF-16","ASCII",sub='')
tx$Cost <- gsub("\\,", replacement = "", x = tx$Cost)
tx$Cost <- as.numeric(tx$Cost)
farnsy
  • 2,282
  • 19
  • 22
  • I tried converting using iconv() as you put it. Originally, I get this [1] "95.31" "992.77" "\"1,719.68\"" "\"3,135.79\"" "111.91" "305.12"; but after the conversion, I get this [1] "" "" "" "" "" "". Am I doing something wrong? – ed0reddie Aug 02 '14 at 00:14
  • does `iconv(tx$Cost,"UTF-8","ASCII",sub='')` work? Alternately, skip the `iconv()` and specify `encoding="ASCII"` in your `read.table()` so that it gets re-encoded? – farnsy Aug 02 '14 at 00:27
  • `iconv(tx$Cost,"UTF-8","ASCII",sub='')` still introduces NAs by coercion; and fileEncoding = "ASCII" leads to error in read.table: no lines available in input.. – ed0reddie Aug 02 '14 at 00:36
  • Bummer. If this file is on the public web, why not share the link and we will work directly on it? – farnsy Aug 02 '14 at 00:52