3

Edit: Per Parfait's recommendation, I found success by specifying ISO-8859-1 encoding instead of UTF_8.

I'm reading in IEEE article metadata & abstracts.

I'm looping through multiple pages of the results. My code has been working well, but then this bit caused the error below:

require(XML)
link <- "http://ieeexplore.ieee.org/gateway/ipsSearch.jsp?py=1934&hc=100&rs=1"
doc <- xmlParse(link, encoding = "UTF_8", options = NOCDATA)

Error:

input conversion failed due to input error, bytes 0x20 0x62 0x65 0x66
encoder errorCData section not finished
Discussion on ¿The measurement of noise, with s
Premature end of data in tag title line 3081
Premature end of data in tag document line 3077
Premature end of data in tag root line 3
Error: 1: input conversion failed due to input error, bytes 0x20 0x62 0x65 0x66
2: encoder error3: CData section not finished
Discussion on ¿The measurement of noise, with s
4: Premature end of data in tag title line 3081
5: Premature end of data in tag document line 3077
6: Premature end of data in tag root line 3

I ran into this same error with this dataset, but successfully parsed it by reading in smaller sets of data at a time (now hc=100 instead of hc=1000).

The gateway query parameters are listed here: http://ieeexplore.ieee.org/gateway/

Any ideas why this error happens and what I can do to work around it?

Session info:

R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.3   XML_3.98-1.3

loaded via a namespace (and not attached):
[1] slidify_0.4.5  markdown_0.7.7 tools_3.2.1    whisker_0.3-2  yaml_2.1.13    Rcpp_0.12.1   
[7] knitr_1.11     stringr_1.0.0 

Thank you for your help!

PatrickB
  • 56
  • 6
  • 2
    Try a different encoding specification like `ISO-8859-1`. See exhaustive [list](http://www.iana.org/assignments/character-sets/character-sets.xml). – Parfait Sep 14 '15 at 20:06
  • Just ran your code, pointing to web page. Zero errors. I use R version 3.1.0, Windows 64-bit. My `sessionInfo()` aligns to yours with no plyr or those namespaces loads. But interestingly my locale is not `es_US.UTF-8` but `LC_COLLATE=English_United States.1252`. See this [SO post](http://stackoverflow.com/questions/20577764/set-locale-to-system-default-utf-8). – Parfait Sep 15 '15 at 02:34
  • @Parfait I had tried using one or two other encoding specs, but apparently not ISO-8859-1. I popped that in and it worked. So far, that's successfully processed many more records without errors. Would be nice if the gateway's source listed the correct encoding method. Thank you for your help! – PatrickB Sep 15 '15 at 10:59

0 Answers0