0


Please excuse me for the title!
I have a xml file but I am not parsing it. I am using R and reading the file using readLines and applying functions like gsub() etc to perform operations. I am applying a condition where "p" becomes ".p" but I do not want to apply it between "table" and "/table".

Input

<?xml version="1.0"?>
<h2>
  <h4>
    <hdtitle>Circuit Description</hdtitle>
    <p>The commanded throttle position (TP)the values</p>
  </h4>
  <h4>
    <hdtitle>DTC Descriptor</hdtitle>
    <p>This diagnostic procedure supports the following DTC:</p>
    <p>DTCP2101 Throttle Actuator Position Performance</p>
  </h4>
  <h4>
    <hdtitle>Test Description</hdtitle>
    <p>The numbers below refer to the step numbers on the diagnostic </p>
      <exp-item id="td08">
        <exp-itemnum>8</exp-itemnum>
        <p>The throttle valve is spring pressure</p>
      </exp-item>
      <exp-item id="td11">
        <exp-itemnum>11</exp-itemnum>
        <p>When the ignition is</p>
      </exp-item>
    <table frame="all" pgwide="page-wide" titlesource="cell-title">
      <tgroup align="left" char="" charoff="50" cols="4" colsep="1" rowsep="1">
        <colspec charoff="50" colname="col1" colwidth="0.51in"/>
        <colspec charoff="50" colname="col4" colwidth="1.40in"/>
        <p>The numbers below refer to the step numbers </p>
      </tgroup>
    </table>
  </h4>
</h2>

Output

<?xml version="1.0"?>
<h2>
  <h4>
    <hdtitle>Circuit Description</hdtitle>
    <p>The commanded throttle position (TP)the values.</p>
  </h4>
  <h4>
    <hdtitle>DTC Descriptor</hdtitle>
    <p>This diagnostic procedure supports the following DTC:</p>
    <p>DTCP2101 Throttle Actuator Position Performance.</p>
  </h4>
  <h4>
    <hdtitle>Test Description</hdtitle>
    <p>The numbers below refer to the step numbers on the diagnostic .</p>
      <exp-item id="td08">
        <exp-itemnum>8</exp-itemnum>
        <p>The throttle valve is spring pressure.</p>
      </exp-item>
      <exp-item id="td11">
        <exp-itemnum>11</exp-itemnum>
        <p>When the ignition is</p>
      </exp-item>
    <table frame="all" pgwide="page-wide" titlesource="cell-title">
      <tgroup align="left" char="" charoff="50" cols="4" colsep="1" rowsep="1">
        <colspec charoff="50" colname="col1" colwidth="0.51in"/>
        <colspec charoff="50" colname="col4" colwidth="1.40in"/>
        <p>The numbers below refer to the step numbers </p>
      </tgroup>
    </table>
  </h4>
</h2>

As you can see, ".p" is not applied between "table" and "/table". Please help!

user227710
  • 3,164
  • 18
  • 35
Karan Pappala
  • 581
  • 2
  • 6
  • 18
  • 1
    I don't see any changes in the output. Where is the code you are running for the `gsub()`. This is still not a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) – MrFlick Jul 08 '15 at 19:42
  • `The commanded throttle position (TP)the values` becomes `The commanded throttle position (TP)the values.`: one with `.` – user227710 Jul 08 '15 at 20:08
  • Use a tool that can handle multiline xml. (NOT regex.) See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454?s=1|1.8688#1732454 – IRTFM Jul 08 '15 at 22:52

1 Answers1

0

I think when you say "replace 'p' by '.p'", you mean "add a full stop at the end of every paragraph"?

You could do it gsub-y like this (BUT handling XML with regex is usually a bad idea: this assumes well-formed XML input, no weird whitespace in the tags, no embedded tags within text/comments, no tables within tables, etc):

xml <- readLines('clipboard') # read in your data from clipboard..
# find starts and ends of tables
tbl.start <- grep('<table\\b', xml)
tbl.end <- grep('</table>', xml)
# find location of the </p>
p.end <- grep('</p>', xml)
# filter these to exclude ones that are inside a <table>
i <- vapply(p.end, function (p) all(p <= tbl.start | p >= tbl.end), F)
p.end <- p.end[i]
# replace
xml[p.end] <- gsub('</p>', '.</p>', xml[p.end])

Note your "desired output" didn't put a full stop after the "This diagnostic procedure supports the following DTC:" or "When the ignition is" despite these being p elements not inside a table element, so you either made a mistake in your desired output or you haven't thought of your criteria properly.

However

When parsing XML you may as well use an XML package.

library(XML)
# read in the XML (I'm doing it from clipboard, you do it however suits you).
# You need useInternalNodes=T though.
doc <- xmlTreeParse(readLines('clipboard',warn=F), useInternalNodes=T)
# get `p` elements that are not inside a `table` element, and
#   add a '.' to the end of the text.
xpathApply(doc, "//p[not(ancestor::table)]",
           function (p) xmlValue(p) <- paste0(xmlValue(p), '.')
           )
# type 'doc' to inspect the output
# could write it out now, for example
saveXML(doc, file='output.xml')

This way you know your regex won't get confused by anything tricksy like tables embedded in tables, or things-that-look-like-tags-but-aren't in the comments/CDATA and all the other problems you get when parsing XML with regex. AND the code is shorter/easier to understand.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • Yes, you are right. My condition was supposed to be to put a "." before /p only when there is a character behind it. Ex: at becomes at.. I am trying to figure this out. Thank you for your response. – Karan Pappala Jul 09 '15 at 05:54
  • So then in the function inside `xpathApply`, only update `xmlValue(p)` if it matches the appropriate regex. `?grep`. – mathematical.coffee Jul 09 '15 at 05:58
  • I have modified the first code as follows: p.end <- grep('[a-z]', xml) and it worked! – Karan Pappala Jul 09 '15 at 07:15