3

I have a bunch of .xml files with nodes that are causing uncessesary complications. I would like to remove these nodes but ensure that thier children are preserved (not the heirarchical structure but the data). Eventually I want to take the data from each .xml and build a dataframe. It seems like xmlTreeParse along with xmlToList will help but the latter only works well with a flat structure. I have played around with unlisting the output from xmlToList and then converting it a dataframe but the output is a bit funky.

I thought about simply writing a function to go through all the files and delete all tags that I don't want however I don't know how to do this in R.

ANy suggestions?

scottyaz
  • 722
  • 7
  • 18
  • It would probably help if you supplied examples of the xml before and after the requested changes. – Brian Scott Jun 27 '10 at 12:44
  • Here is an extract of the xml I am starting with: SWES_20.0.22010-06-26T18:19:02.5982010-06-26T18:21:11.742Melissa32010-06-261dzemeni26846568560 – scottyaz Jun 27 '10 at 12:56
  • I want to simply remove the tags – scottyaz Jun 27 '10 at 12:57
  • May be opening a connection to each file and using gsub("",,files)? – scottyaz Jun 27 '10 at 12:58
  • `gsub` is definitely NOT the thing you need. Substitution of "" or "" tags with "" will not remove text within ""s. You need XML parser, i.e. XML package available in CRAN repos. Sadly, I haven't got much experience in R's XML parsing features. – aL3xa Jun 27 '10 at 17:05

2 Answers2

3

It's simple to do in XSLT. Add this to the identity transform:

<xsl:template match="poop">
   <xsl:apply-templates select="node()"/>
</xsl:template>

Using regular expressions on XML hastens the coming of the Elder Gods and is not recommended.

Community
  • 1
  • 1
Robert Rossney
  • 94,622
  • 24
  • 146
  • 218
0

see if this is what you are looking for, you can use XML package from CRAN for the parsing of XML documents. You can use the following tactic to get only the <poop> tags:

me<-xmlTreeParse(filename,useInternalNodes=T)
pooptags<-xpathApply(me,"//poop")

pooptags will contain the following information :

<poop>
  <P3a_Village1>dzemeni</P3a_Village1>
  <P4_HousholdNumber/>
  <P5_VisitNumber>2</P5_VisitNumber>
</poop> 

you can paste this with the <?xml version='1.0' ?> using paste command in R and write it to a truncated file. or you can further extract information like P3a_Village1 from the XML file using the xpathApply like this:

village<-xpathApply(me,"//poop/P3a_Village1")

I hope the solution is what you are looking for. Please let me know if it helps.

Shreyas Karnik
  • 3,953
  • 5
  • 27
  • 26
  • thanks for the help. I think this would be a looong way to do it so I decided to use an xslt script. Oh well... – scottyaz Jun 28 '10 at 01:42