-1

I need to make a parse to an XML file. I need to take time codes (Beginning and ending) and the sentence related to this times.

The XML file is something like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="jj" audio_filename="01" version="1" version_date="150211">
 <Episode>
  <Section type="report" startTime="0" endTime="50.28281021118164">
   <Turn startTime="0" endTime="50.28281021118164">
    <Sync time="0"/>

    <Sync time="1.195"/>
    Something
    <Sync time="2.654"/>
    Something 2
    <Sync time="4.356"/>
    Something 3
    <Sync time="9.321"/>
    Something 4
    <Sync time="22.171"/>
    Something 5
    <Sync time="28.351"/>
    Something 6
    <Sync time="35.708"/>
    Something 7
    <Sync time="43.04"/>
    Something 8
   </Turn>
  </Section>
 </Episode>

And I need to obtain this final result:

0  1.195
1.195 2.654 Something
2.654 4.356 Something 2
4.356 9.321 Something 3
9.321 22.171 Something 4
22.171 28.351 Something 5
28.351 35.708 Something 6
35.708 43.04 Something 7
43.04 "endTime" Something 8

I'm working with Ubuntu, any suggestions? Is it posible to do this with bash?

Thank you!

Sergi
  • 417
  • 6
  • 18
  • 1
    @Serv the issue with that duplicate is it seems to actually be suggesting that you write a parser yourself, which personally I would regard as a terrible idea. – Tom Fenech Apr 30 '15 at 10:52
  • I second Tom's opinion, the linked-as-duplicate answer is highly voted for its novelty, not for its quality. It's actually a pretty bad answer. – Tomalak Apr 30 '15 at 11:05

1 Answers1

2

XSLT to the rescue! Use the stylesheet

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>

  <xsl:template match="@*|node()">
    <xsl:apply-templates select="@*|node()"/>
  </xsl:template>

  <xsl:template match="Sync">
    <xsl:value-of select="@time"/>
    <xsl:text> </xsl:text>
    <xsl:choose>
      <xsl:when test="following-sibling::Sync">
        <xsl:value-of select="following-sibling::Sync[1]/@time"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:text>"endTime"</xsl:text>
      </xsl:otherwise>
    </xsl:choose>
    <xsl:text> </xsl:text>
    <xsl:value-of select="normalize-space(following-sibling::text()[1])"/>
    <xsl:text>&#xa;</xsl:text>
  </xsl:template>
</xsl:stylesheet>

And then the XSLT processor of your choice to apply it to the XML file. For example, with xsltproc:

xsltproc file.xsl file.xml

where file.xsl contains the above stylesheet and file.xml is your XML file.

Wintermute
  • 42,983
  • 5
  • 77
  • 80
  • `following-sibling::Sync[1]/@time`, and you don't need the identity template. – Tomalak Apr 30 '15 at 11:03
  • @Tomalak Both appear to work. Let me check the spec. – Wintermute Apr 30 '15 at 11:06
  • Yes, but it's unclean. `value-of` turns the given nodeset to string, a conversion that is done only for the first node in a set. So `following-sibling::Sync/@time` selects more than one node, but `` prints the first one. I consider this as "works by accident" and therefore as a lingering bug. – Tomalak Apr 30 '15 at 11:09
  • 1
    I suppose the mechanism is a bit arcane. I'll do the same for the `following-sibling::text()` at the bottom, then. Thanks! – Wintermute Apr 30 '15 at 11:12
  • It's the same mechanism that kicks in when you say `substring(//multiple-nodes, 1, 5)` - this returns the first 5 characters of the first node of the selected set. XSLT 1.0 does that silently, XSLT 2.0+ will explicitly complain about trying to use a sequence of more than one item when a single node was expected. – Tomalak Apr 30 '15 at 11:16
  • I actually like that; complaining when the programmer does silly things is a good thing in compilers/interpreters/processors. I'll look deeper into XSLT 2 at some point; my experience so far has been that there are few XSLT processors that understand it. Although I have to admit that I don't exactly use XSLT every day. – Wintermute Apr 30 '15 at 11:32
  • Few *client side* processors, that's true. But XSLT has never really been a client side language, despite the fact that it originally aspired to be one. XSLT on the server has plenty of choices for 2.0. – Tomalak Apr 30 '15 at 11:34