1

I am trying to read a series of text files into R. These files are of the same form, at least appear to be of the same form. Everything is fine except one file. When I read that file, R treated all numbers as characters. I used as.numeric to convert back, but the data value changed. I also tried to convert text file to csv and then read into R, but it did not work, either. Did any one have such problem before, please? How to fix it, please? Thank you!

The data is from Human Mortality Database. I cannot attach the data here due to copyright issue. But everyone can register through HMD and download data (www.mortality.org). As an example, I used Australian and Belgium 1 by 1 exposure data.

My codes are as follows:

AUSe<-read.table("AUS.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]

Then I want to add some rows in the above data frame or matrix. It is fine for Australian data (e.g, AUSe[1,3]+AUSe[2,3]). But error occurred when same command is applied to Belgium data: Error in BELe[1, 3] + BELe[2, 3] : non-numeric argument to binary operator. But if you look at the text file, you know those are two numbers. It is clear that R treated a number as a character when reading the text file, which is rather odd.

Christian
  • 25,249
  • 40
  • 134
  • 225
LaTeXFan
  • 1,136
  • 4
  • 14
  • 36
  • `read.csv( ..., stringsAsFactors=FALSE)` (Edit: This jives with @josilber's comment) – Ari B. Friedman Dec 13 '13 at 21:47
  • What was in the text file you were converting to a csv? How did it not work? Please post examples of the problems you're having and what you've tried. – josliber Dec 13 '13 at 21:48
  • 1
    As hinted most likely you had some characters in the column. This will cause R to store it as factors initially. When using as.numeric on a factor you won't get the original numbers back - you'll get the factor level back. The given answer will allow you to read it in as character - at which point you should examine your data to see what values aren't 'actually numeric'. – Dason Dec 13 '13 at 21:52

3 Answers3

1

Try this instead:

BELe<-read.table("BEL.Exposures_1x1.txt",skip=1, colClasses="numeric", header=TRUE)[,-5]

Or you could surely post just a tiny bit of that file and not violate any copyright laws at least in my jurisdiction (which I think is the same one as The Human Mortality Database).

Belgium, Exposure to risk (period 1x1)     Last modified: 04-Feb-2011, MPv5 (May07)

   Year      Age       Female          Male         Total
   1841        0        61006.15     62948.23    123954.38 
   1841        1        55072.53     56064.21    111136.73 
   1841        2        51480.76     52521.70    104002.46 
   1841        3        48750.57     49506.71     98257.28 
   ....         .        ....

So I might have suggested the even more accurate colClasses:

BELe<-read.table("BEL.Exposures_1x1.txt",skip=2,  #  really two lines to skip I think
                 colClasses=c(rep("integer", 2), rep("numeric",3)),
                 header=TRUE)[,-5]

I suspect the promlem occurs because of lines like these:

   1842      110+           0.00         0.00         0.00 

So you will need to determine how much interest you have in preserving the 110+ values. With my method they will be coerced to NA's. (Well I thought they would be but like you I got an error. So this multi-step process is needed:

 BELe<-read.table("Exposures_1x1.txt",skip=2,
                  header=TRUE)
 BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.character)
 str(BELe)
#-------------
'data.frame':   18759 obs. of  5 variables:
 $ Year  : int  1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
 $ Age   : chr  "0" "1" "2" "3" ...
 $ Female: chr  "61006.15" "55072.53" "51480.76" "48750.57" ...
 $ Male  : chr  "62948.23" "56064.21" "52521.70" "49506.71" ...
 $ Total : chr  "123954.38" "111136.73" "104002.46" "98257.28" ...
#-------------
 BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.numeric)

#----------
Warning messages:
1: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
2: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
3: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
4: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
str(BELe)
#-----------
'data.frame':   18759 obs. of  5 variables:
 $ Year  : int  1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
 $ Age   : num  0 1 2 3 4 5 6 7 8 9 ...
 $ Female: num  61006 55073 51481 48751 47014 ...
 $ Male  : num  62948 56064 52522 49507 47862 ...
 $ Total : num  123954 111137 104002 98257 94876 ...
# and just to show that tey are not really integers:
 BELe$Total[1:5]
#[1] 123954.38 111136.73 104002.46  98257.28  94875.89
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thank you for your response. But I still have a problem. One entry in the second column is "110+", which would cause an error. Therefore, I changed colClasses to integer, character and 3 numerics. Now I got the following error: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '.' – LaTeXFan Dec 13 '13 at 22:27
  • Besides, I really cannot see the difference between AUS and BEL data. Why would R treat them differently, please? – LaTeXFan Dec 13 '13 at 22:31
  • Weird. I though my colClasses approach would coerce to 'numeric'. Not sure why it didn't. Posting tested solution: – IRTFM Dec 13 '13 at 23:19
  • Thank you again. I still think there should be an easier way to handle this. – LaTeXFan Dec 14 '13 at 00:07
  • Well, I _thought_ there was a simple way. There certainly used to be a way with colClasses and that is certainly how I read the manual. I'm thinking a bug has been introduced in `read.table`. – IRTFM Dec 14 '13 at 00:19
  • Another confusing question is why Belgium file is different from other countries' files. The basic function works well for other countries. Any idea on this? Thanks. – LaTeXFan Dec 14 '13 at 00:23
  • No ideas. I just posted a question to R-help about why the colClasses spec was not working. Furthermore, I just tried with a fragment of the AUS data and get the same error. – IRTFM Dec 14 '13 at 00:52
  • For AUS data, all you need is read.table(file). This way the exposure data is imported as numbers without using colClass argument. – LaTeXFan Dec 14 '13 at 01:38
  • just specify `na.strings = "."` in the `read.table()` arguments, and either `stringsAsFactors = TRUE` or `as.is = TRUE` and most of this (except the 110+) goes away. I put this in an answer. – tim riffe Dec 17 '13 at 17:25
  • Perhaps you meant `stringsAsFactors=FALSE`? – IRTFM Dec 17 '13 at 18:09
1

The way I typically read those files is:

BELexp <- read.table("BEL.Exposures_1x1.txt", skip = 2, header = TRUE, na.strings = ".", as.is = TRUE)

Note that Belgium lost 3 years of data in WWI that may never be recovered, and hence these three years are all NAs, which in those files are marked with ".", a character string. Hence the argument na.strings = ".". Specifying that argument will take care of all columns except Age, which is character (intentionally), due to the "110+". The reason the HMD does this is so that users have to be intentional about treatment of the open age group. You can convert the age column to integer using:

BELexp$Age <- as.integer(gsub("[+]", "", BELexp$Age))

Since such issues are long the bane of R-HMD users, the HMD has recently posted some R functions in a small but growing package on github called (for now) DemogBerkeley. The function readHMD() removes all of the above headaches:

library(devtools)
install_github("DemogBerkeley", subdir = "DemogBerkeley", username = "UCBdemography")

BELexp <- readHMD("BEL.Exposures_1x1.txt")

Note that a new indicator column, called OpenInterval is added, while Age is converted to integer as above.

tim riffe
  • 5,651
  • 1
  • 26
  • 40
0

Can you try read.csv(... stringsAsFactors=FALSE) ?

Wilmer E. Henao
  • 4,094
  • 2
  • 31
  • 39
  • I think there is a similar question. Maybe your answer is there? http://stackoverflow.com/questions/13706188/importing-csv-file-into-r-numeric-values-read-as-characters – Wilmer E. Henao Dec 13 '13 at 22:36