I am storing a big amount of URLs (around 100,000) in a XML file (along with some other data). It worked fine with fewer URLs, but now, the XML file has become very big (because of tags and indentation) and slow to parse. So I thought about grouping all the URLs inside a single XML element, and for that I need a delimiter. As an example, I would like to go from this:
<document>
<bigGroupOfURLs>
<OneURL>
<nameOfData1>data1_1</nameOfData1>
<nameOfData2>data1_2</nameOfData2>
<URL>www.site1.com</URL>
</OneURL>
<OneURL>
<nameOfData1>data2_1</nameOfData1>
<nameOfData2>data2_2</nameOfData2>
<URL>www.site2.com</URL>
</OneURL>
</bigGroupOfURLs>
<someOtherData>...</someOtherData>
</document>
To something like this (but not using #):
<document>
<bigGroupOfURLs>
data1#data2#www.site1.com#data1#data2#www.site2.com
</bigGroupOfURLs>
<someOtherData>...</someOtherData>
</document>
These URLs will come from tags inside HTML files, so they can come with all sorts of non-standard characters. For instance, the following are examples which may be included:
<a href="http://ja.wikipedia.org/wiki/メインページ">メインページ</a>
<a href="http://en.wikipedia.org/wiki/Stack Overflow">Stack Overflow</a>
There, we can see UTF-8 characters and a space. These URLs are correctly interpreted, and I want to store them as they appear there. So, which character is guaranteed to never appear in a URL? I would prefer it to be a printable character. Notice that this will be inside a XML file, so I probably should not use the characters </>
.