1

I have a notepad txt file called inflation.txt.

The file has two columns (delimited with a "space") and looks something like this:

1950-1 0.0084490544865279
1950-2 −0.0050487986543660
1950-3 0.0038461526886055
1950-4 0.0214293914558992
1951-1 0.0232839389540449
1951-2 0.0299121323429455
1951-3 0.0379293285389640
1951-4 0.0212773984472849

I am trying to import this file into R.

Reading this previous stackoverflow post over here Reading text file with multiple space as delimiter in R , I adapted the code for my problem

data <- read.table("inflation.txt", sep = "" , header = F ,
                   na.strings ="", stringsAsFactors= F)

But when I run the above code, an unwanted character appears ( " −") :

> head(data)

      V1                    V2
1 1950-1    0.0084490544865279
2 1950-2 −0.0050487986543660
3 1950-3    0.0038461526886055
4 1950-4    0.0214293914558992
5 1951-1    0.0232839389540449
6 1951-2    0.0299121323429455

Can someone please show me what I am doing wrong? Is the data getting corrupted? Is there a way to fix this problem?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
stats_noob
  • 5,401
  • 4
  • 27
  • 83

2 Answers2

2

What do you get if you try this

data <- read.table("inflation.txt", sep = "" , header = F ,
                   na.strings ="", stringsAsFactors= F, encoding = "UTF-8")

That weird character looks like a utf-8 symbol

UseR10085
  • 7,120
  • 3
  • 24
  • 54
rdodhia
  • 350
  • 2
  • 9
1

The minus sign in the file isn't - , it's .
You can compare the character code using this link.
In the case of -, you get ASCII 45 which corresponds to unicode 002D, and with the character above you get 8722 which correponds to unicode 2212.
Both are minus sign, but read.table expects the first version.

You could replace the wrong characters sequence :

file <- readLines('inflation.txt')
file <- gsub( "−", "-", file )

data <- read.table(textConnection(file), sep = "" , header = F ,
           na.strings ="", stringsAsFactors= F)

head(data) 
      V1           V2
1 1950-1  0.008449054
2 1950-2 -0.005048799
3 1950-3  0.003846153
4 1950-4  0.021429391
5 1951-1  0.023283939
6 1951-2  0.029912132 
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • Thank you for your answer! I tried the answer above and all negative values are replaced with NA. Do you know how to fix this? – stats_noob Feb 20 '21 at 22:43
  • thank you for your reply. I tried to convert the "V2" column as a "numeric" but it does not seem to be working. I posted a related question over here: https://stackoverflow.com/questions/66299281/r-na-introduced-by-coercion . Can you please take a look at it if you have time? Thank you for all your help - I really appreciate it, – stats_noob Feb 21 '21 at 04:57
  • @Noob, on my computer, `class(data$V2)` is `numeric`. You could try the `dec='.'` argument in `read.table` – Waldi Feb 21 '21 at 07:22