0

I have a huge (85GB) XML-file with various data on cars in Denmark, from which I need to extract some data (not all). There is much more info in the actual file, but a sample (roughly translated) of the file is presented here.

<?xml version="1.0" encoding="UTF-8"?>
<ns:ESStatistikListeModtag_I xmlns:ns="http://skat.dk/dmr/2007/05/31/">
  <ns:Statistic>
    <ns:VehicleType>Personbil</ns:VehicleType>
    <ns:RegNo>XX12345</ns:RegNo>
    <ns:VehicleInfo>
      <ns:VehicleMake>AUDI</ns:VehicleMake>
      <ns:VehicleModel>Q7</ns:VehicleModel>
    </ns:VehicleInfo>
    <ns:VehicleInspection>
      <ns:InspectionDate>2000-05-31+02:00</ns:InspectionDate>
      <ns:InspectionResult>Approved</ns:InspectionResult>
    </ns:VehicleInspection>
  </ns:Statistic>
  <ns:Statistic>
    <ns:VehicleType>Personbil</ns:VehicleType>
    <ns:RegNo>YY54321</ns:RegNo>
    <ns:VehicleInfo>
      <ns:VehicleMake>RENAULT</ns:VehicleMake>
      <ns:VehicleModel>CLIO</ns:VehicleModel>
    </ns:VehicleInfo>
    <ns:VehicleInspection>
      <ns:InspectionDate>2008-11-31+02:00</ns:InspectionDate>
      <ns:InspectionResult>Approved</ns:InspectionResult>
      <ns:InspectionKm>310</ns:InspectionKm>
    </ns:VehicleInspection>
  </ns:Statistic>
  <ns:Statistic>
    <ns:VehicleType>Van</ns:VehicleType>
    <ns:RegNo>QQ78901</ns:RegNo>
    <ns:VehicleInfo>
      <ns:VehicleMake>AUDI</ns:VehicleMake>
      <ns:VehicleModel>Q3</ns:VehicleModel>
    </ns:VehicleInfo>
    <ns:VehicleInspection>
      <ns:InspectionDate>2010-10-08+02:00</ns:InspectionDate>
      <ns:InspectionResult>Approved</ns:InspectionResult>
      <ns:InspectionKm>78</ns:InspectionKm>
    </ns:VehicleInspection>
  </ns:Statistic>
</ns:ESStatistikListeModtag_I>

I have looked at various questions, but my limited XML-skills makes it hard to handle the namespaces in front of all nodes. I especially looked at answers like the one from Martin Morgan at Combine values in huge XML-files.

What I want is to - for entries with a value of InspectionKm - extract registration number (RegNo) as an id and then, for example, vehicle make (VehicleMake) and the value of inspection kilometers (InspectionKm).

Can anyone explain how I use xmlEventParse to extract the relevant info?

2 Answers2

0

I don't know about xmlEventParse, but if you're prepared to consider different technology, you can do this in a streaming XSLT 3.0 transformation as:

<xsl:transform version="3.0" 
               xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
               xpath-default-namespace="http://skat.dk/dmr/2007/05/31/">
<xsl:mode streamable="yes" on-no-match="shallow-skip"/>
<xsl:template match="Statistic" >
  <xsl:variable name="this" select="copy-of(.)"/>
  <xsl:if test="exists($this//InspectionKm)">
    <out make="{$this/VehicleInfo/VehicleMake}" km="{$this//InspectionKm}"/>
  </xsl:if>
</xsl:template>
</xsl:transform>

As a first guess, I would expect it to take an hour or so.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

Here is my approach using the xml2 package. Of course given the size of the your file, I am not sure what performance/ memory limitations which you may encounter.

library(xml2)
library(dplyr)

#get namespace
ns<-xml_ns(file)

#find parent nodes which contain all requested information
statistic <-xml_find_all(file, ".//ns:Statistic", ns) 

#get  request information from each node
regno <- xml_find_first(statistic, ".//ns:RegNo") %>% xml_text()
make <- xml_find_first(statistic, ".//ns:VehicleMake") %>% xml_text()
km <- xml_find_first(statistic, ".//ns:InspectionKm") %>% xml_text()

#merge into a final dataframe

answer <- data.frame(regno, make, km)
Dave2e
  • 22,192
  • 18
  • 42
  • 50