4

I am storing a big amount of URLs (around 100,000) in a XML file (along with some other data). It worked fine with fewer URLs, but now, the XML file has become very big (because of tags and indentation) and slow to parse. So I thought about grouping all the URLs inside a single XML element, and for that I need a delimiter. As an example, I would like to go from this:

<document>
  <bigGroupOfURLs>
    <OneURL>
      <nameOfData1>data1_1</nameOfData1>
      <nameOfData2>data1_2</nameOfData2>
      <URL>www.site1.com</URL>
    </OneURL>
    <OneURL>
      <nameOfData1>data2_1</nameOfData1>
      <nameOfData2>data2_2</nameOfData2>
      <URL>www.site2.com</URL>
    </OneURL>
  </bigGroupOfURLs>
  <someOtherData>...</someOtherData>
</document>

To something like this (but not using #):

<document>
  <bigGroupOfURLs>
    data1#data2#www.site1.com#data1#data2#www.site2.com
  </bigGroupOfURLs>
  <someOtherData>...</someOtherData>
</document>

These URLs will come from tags inside HTML files, so they can come with all sorts of non-standard characters. For instance, the following are examples which may be included:

<a href="http://ja.wikipedia.org/wiki/メインページ">メインページ</a>
<a href="http://en.wikipedia.org/wiki/Stack Overflow">Stack Overflow</a>

There, we can see UTF-8 characters and a space. These URLs are correctly interpreted, and I want to store them as they appear there. So, which character is guaranteed to never appear in a URL? I would prefer it to be a printable character. Notice that this will be inside a XML file, so I probably should not use the characters </>.

memo1288
  • 748
  • 1
  • 6
  • 12
  • 1
    You should use an XML parser, so that you can correctly handle entities. – SLaks Oct 08 '13 at 19:11
  • I already do. This is not about handling XML entities, but separating URLs – memo1288 Oct 08 '13 at 19:12
  • 1
    json array to the rescue? – dezman Oct 08 '13 at 19:13
  • 1
    shouldn't be using xml anyways. xml is a conspiracy by storage vendors to boost sales by drastically bloating storage requirements of data. xml is handy for transmission of data, but it should never be a STORAGE medium. – Marc B Oct 08 '13 at 19:14
  • If it's just a list of URLs, why not store them some other way? Even a simple CSV text file would work better than this. – Geobits Oct 08 '13 at 19:16
  • 2
    I think this answer would help you [What characters make a URL invalid] http://stackoverflow.com/a/1547940/2145211, but I would suggest not doing what you're attempting at all – Harrison Oct 08 '13 at 19:17
  • @Geobits Notice that there is also , the list of URLs is only a part of the file – memo1288 Oct 08 '13 at 19:18
  • I saw that, but condensing the URLs into one field sorta negates the point of using XML for them in the first place. XML is fine for representing things in a tag-element form, but if you're going to flatten it, it doesn't make sense. I said nothing about changing the way the rest of the data is stored. – Geobits Oct 08 '13 at 19:20
  • 1
    agree that json (or yaml=json+) seems like a better choice here.. given that the URLs are the bulk of the filesize, it's not clear that bunging them all together will actually save you much parsing time, and cause major headaches with extracting the data - you will lose the ability to process the XML incrementally for instance; you'll have to read in all 100K urls at once. json/yaml are lower overhead. maybe you can find a faster parser... e.g. in python, lxml > cElementTree > ElementTree... – Corley Brigman Oct 08 '13 at 19:34
  • What an extraordinarily presumptive collection of 'answers' (well, comments on the question). There's absolutely nothing wrong with the strategy that the poster is pursuing. – Michael Kay Oct 08 '13 at 22:21
  • @Ken I'm allowing URLs which are invalid by the standard, but which can be interpreted as valid URLs – memo1288 Oct 09 '13 at 01:39

2 Answers2

3

There is more than one definition of "URL". Very often the term is used where "URI" or "IRI" is more correct. Many systems try to be permissive and allow things that are not technically legal according to the specs; Postel's law applies here, with its inevitable consequence that if some systems start being liberal about what they accept, everyone else has to follow suit.

A pretty safe delimiter to use is a single space, especially if you take care to ensure that any spaces within a URL are properly %-encoded as %20.

But before going for a micro-syntax like this, I would want to be quite convinced that XML parsing time really is the bottleneck.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
2

Both of the URLs you mentioned are actually invalid:

http://ja.wikipedia.org/wiki/メインページ
http://en.wikipedia.org/wiki/Stack Overflow

If you type them in your browser, they will be percent-encoded before they're sent to the server. According to RFC 3986, the space character and the following printable ASCII characters are invalid in an URL:

" < > \ ^ ` { | }

Multi-byte UTF-8 sequences are invalid as well. That said, it's possible that some servers still accept these characters.

So I'd suggest that you normalize your URLs and separate them with whitespace.

Community
  • 1
  • 1
nwellnhof
  • 32,319
  • 7
  • 89
  • 113