0

I would like to extract data from this text blob. This text contains both tab-delimited text and xml tagged text. I would like to extract the xml blob and parse it separately for my analysis.

Text1   Text2   text3   text4   text4   <Assessment>
  <Questions>
    <Question>
      <Id>1</Id>
      <Key>Instructions</Key>
      <QuestionText>Your Age</QuestionText>
      <QuestionType>Label</QuestionType>
      <Answer>16-30</Answer>
    </Question>
  </Questions>
</Assessment>   text5
Text1   Text2   text3   text4   text4   <Assessment>
  <Questions>
    <Question>
      <Id>1</Id>
      <Key>Instructions</Key>
      <QuestionText>Your Age</QuestionText>
      <QuestionType>Label</QuestionType>
      <Answer>31-49</Answer>
    </Question>
  </Questions>
</Assessment>   text5

I have read the text using readlines and did the following.

tst<-gsub("^\\s+","", tst)
idx<-which(grepl("+<Assessment>+", tst))
tst[idx]<-"<Assessment>"
idx<-which(grepl("</Assessment>", tst))
tst[idx]<-"</Assessment>"

Still haven't figured out how to parse it using XML.

JeanVuda
  • 1,738
  • 14
  • 29
  • Please see [how to make a great R reproducible question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You have posted no code. You are—in essence—asking for Code As A Service. That is not what SO is for. What have you tried? – hrbrmstr Dec 15 '15 at 01:56

1 Answers1

1

You may want to have a try of

getNodeSet

from XML package http://www.inside-r.org/packages/cran/xml/docs/matchNamespaces

pidig89
  • 169
  • 1
  • 11