1

Let www.exampleweb.com be a website with data like that:

...  
-3.7358293e+000
7.6062331e-001
6.0701401e+000
-1.6897975e+000
-2.1088811e+000
2.7172791e+000
-2.5477626e+000
...

1 column with 1000 rows.
I'm obtaining data from this website in two ways:
1.

con = url("www.exampleweb.com")  
data_from_html <- readLines(con)  
close(con) 

Now need to convert data, because

str(data_from_html)
chr [1:1000] " -2.9735888e+000" " -1.4757566e+000" "  8.6980880e-001" "  4.9502553e+000" ...  

So:

converted <- as.numeric(data_from_html)

Copying (ctrl+a) the whole site, and pasting into .txt file. Saving as "my_data.txt".

data_from_txt <- read.table("my_data.txt")  

Now, when I use

summary(converted)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-16.2800  -1.5030  -0.0598  -0.1809   1.2220  13.0100   

But on the other hand:

summary(data_from_txt)
       V1          
 Min.   :-16.2789  
 1st Qu.: -1.5026  
 Median : -0.0598  
 Mean   : -0.1809  
 3rd Qu.:  1.2217  
 Max.   : 13.0112  

I can't decide which one is better, but I feel like there is some data loss in converting from char to numeric. I don't know how to prevent it. I even checked head/tail of these variables, but they've got same values:

head(converted)

[1] -2.9735888 -1.4757566  0.8698088  4.9502553 -4.3059115  0.9745958
> tail(converted)
[1] -3.007217 -4.600345 -3.740255  2.579664 -2.233819 -1.028491    

head(data_from_txt)
          V1
1 -2.9735888
2 -1.4757566
3  0.8698088
4  4.9502553
5 -4.3059115
6  0.9745958
> tail(data_from_txt)
            V1
995  -3.007217
996  -4.600345
997  -3.740255
998   2.579664
999  -2.233819
1000 -1.028491  

How to deal with it? Does it mean I should never web scrape data? What if I, for some reason, can't create .txt file? Maybe I jest need better method for data conversion?

Photon Light
  • 757
  • 14
  • 26
  • Why not use `read.table("http://www.exampleweb.com")` ? Or if that doesn't work, `library(XML); readHTMLTable("http://www.exampleweb.com")` There are many easy ways to get data from websites with R. Giving us an actual web-page to look at would be better than giving a fake one – Rich Scriven Oct 20 '14 at 17:05
  • Well, it works, thank you! I'm more math guy, than computer guy, so I'm still learninr R, don't know each method yet. That was helpful. But anyway, I still wonder, where is the devil in operations I've posted, just curious. – Photon Light Oct 20 '14 at 17:13
  • I would have thought these would have been equivalent. Without a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that we can run to reproduce the same results, it's hard to say what might be going on. – MrFlick Oct 20 '14 at 17:18

0 Answers0