0

I'm mimicking a large xml file, which I'm willing to import in mediawiki. File is done, yet content in <text>content</text> still has remaining < and > I must encode first.

I wish encoding step may be done with regex (I'm using Windows and software like sublime text or edit pad or vim). I shoud be able to run a php script as well.

Using ({{word)(.*?)(?=</text>)I was able to select all targets for replacements – as I dont want to encode the xml markup itself – but I dont know how get the hard job done, i.e. how to replace all < and > lying in the well targeted text.

For better clarity here it is a light extract of how the content where I need to encode a few caracters looks like (I have 50000 more like that in a 30 mo file) :

      <page>
    <title>Title:75002</title>
    <ns>510</ns>
    <id>21</id>
    <revision>
      <id></id>
      <parentid></parentid>
      <timestamp>2015-1-5T14:49:09Z</timestamp>
      <contributor>
        <ip>0:0:0:0:0:0:0:1</ip>
      </contributor>
      <text xmlspace="preserve" bytes="345">{{word

| vedette             ={{{vedette}}}
| id            ={{ROOTPAGENAME}}

| vedette           =boutique, with forbidden > and 
 evil < multiline

<!-----------encyclo---------->

| étymologie        = still have sometimes a messing > 
and maybe a < more.

<!-----------relations-------->

| synonyme          ={{AutoLienSyno | }}

}}</text>
      <sha1></sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
  </page>

Thank you.

Nemo
  • 2,441
  • 2
  • 29
  • 63
Gilles
  • 155
  • 1
  • 7
  • 2
    *"mimicking a large xml file"* -- You *what*? Instead of trying to fix the mess after the fact, just build your XML file with a proper tool right from the start and everything falls into place automatically. – Tomalak Feb 22 '15 at 16:54
  • Do **not** try to manipulate XML or HTML with regexes. See [Can you provide some examples of why it is hard t o parse XML and HTML with a regex?](http://stackoverflow.com/q/701166/62576) for a long list of reasons why. Make life easier on yourself and build the XML properly with a tool that will handle the encoding for you automatically in the first place. – Ken White Mar 03 '15 at 14:10

1 Answers1

0

The easy way to do multiple substitutions in a repeated selection of text, for me, was to use sed.

Write a command.txt file with :

 /<text/,/<\/text>/{
   /<text/b
   /<\/text>/b
   s/\&/\&amp;/g
   s/>/\&gt;/g
   s/</\&lt;/g
 }

Then run sed -f command.txt input.xml > output.xml

This way, all < > & will be encoded, only in the targeted portions of text delimited by <text and </text> (these boundaries remain unaltered).

doc here : http://sed.sourceforge.net/sedfaq4.html#s4.24

Gilles
  • 155
  • 1
  • 7