I'm mimicking a large xml file, which I'm willing to import in mediawiki.
File is done, yet content in <text>content</text>
still has remaining <
and >
I must encode first.
I wish encoding step may be done with regex (I'm using Windows and software like sublime text or edit pad or vim). I shoud be able to run a php script as well.
Using ({{word)(.*?)(?=</text>)
I was able to select all targets for replacements – as I dont want to encode the xml markup itself – but I dont know how get the hard job done, i.e. how to replace all < and > lying in the well targeted text.
For better clarity here it is a light extract of how the content where I need to encode a few caracters looks like (I have 50000 more like that in a 30 mo file) :
<page>
<title>Title:75002</title>
<ns>510</ns>
<id>21</id>
<revision>
<id></id>
<parentid></parentid>
<timestamp>2015-1-5T14:49:09Z</timestamp>
<contributor>
<ip>0:0:0:0:0:0:0:1</ip>
</contributor>
<text xmlspace="preserve" bytes="345">{{word
| vedette ={{{vedette}}}
| id ={{ROOTPAGENAME}}
| vedette =boutique, with forbidden > and
evil < multiline
<!-----------encyclo---------->
| étymologie = still have sometimes a messing >
and maybe a < more.
<!-----------relations-------->
| synonyme ={{AutoLienSyno | }}
}}</text>
<sha1></sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
Thank you.