0

I've read the SO questions about why never to use regex on {HT,X}ML, such as this one -- Regex to Indent an XML File , but I thought I'd post a function I wrote that does absolutely nothing but indent XML lines based on their levels of subordination.

To meet the guidelines of SO, I'll Jeopardy - ize my solution :-) , so--

What will go wrong when I start using this function to format XML files that some unnamed bad person sent my sans any indents?

xmlit <- function(x,indchar = '\t'){
# require x to be a vector of char strings, one
# per line of the XML file.  
# Add an indent for every line below one starting "<[!/]" and
# remove an indent for every line below "</" 

indit <-''
y<-vector('character',length(x))
for(j in 1:length(x) ) {
# first add whatever indent we're up to
    y[j] <- paste(indit,x[j],collapse='',sep='')
    # check for openers: '<' but not '</' or '/>'
  if( grepl('<[^/?!]' ,x[j]) & !grepl('/>', x[j]) & !grepl('</',x[j]) ) {
            indit<-paste(indit,indchar,collapse='',sep='')
  } else {
   # check for closers: '</' 
    if( grepl('<[/]' ,x[j]) & !grepl('<[^/?!]',x[j])  ) {
# move existing line back out one indent
        y[j]<- substr(y[j],2,1000)
        indit<-substr(indit,2,1000)
    }
}
}
# Note that I'm depending on every level to have a matching closer,
# and that in particular the very last line is a closer.
return(invisible(y))
}
Community
  • 1
  • 1
Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73

1 Answers1

0

There is also an assumption that any opening tag must be the first thing on a line. If not, there are problems:

> cat(xmlit(c("<begin>","<foo/><begin>","</begin>","</begin>")), sep="\n")
<begin>
        <foo/><begin>
</begin>
/begin>

For some subset of XML with enough assumptions about (additional) structure, regular expressions can work. But if the assumptions get violated, well, that's why there are parsers.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
  • Yeah, that's the impression I was getting. I'll keep my code local, for use w/ the particular flavor of xml being used on specific projects. – Carl Witthoft Jul 18 '13 at 11:51