7

I know this is not a very straightforward question, but how would you design a SQL database in order to store data coming from a XML file?, (you have no XML schema, just a plain XML with a lots of different tagnames, attributes and nested elements).

This is more conceptual than technical. How we go from hierarchical data model (XML) to relational model (SQL).

ppeterka
  • 20,583
  • 6
  • 63
  • 78
Roger W.
  • 327
  • 1
  • 4
  • 15
  • 2
    If you have no schema, and want to use a traditional RDBMS, the traditional relational way, you're basically screwed. But you can go for the XML datatypes that are present in all major database systems. That allows you to handle XML content using XPath expressions, which is quite neat. – ppeterka Oct 30 '12 at 08:29
  • 1
    can you share the XML structure?? – Furqan Hameedi Oct 30 '12 at 08:29

3 Answers3

2

If you have no schema, and want to use a traditional RDBMS, the traditional relational way, you're basically screwed.

But you can go for the XML datatypes (in Oracle (>9i), in MS SQL (>2005), in Postgres, in DB2 ) that are present in all major database systems. That allows you to handle XML content using XPath expressions, which is quite neat.

Also recommended reading:

Or, you can skip having to convert your hierarchical model to relational, as this seems to be the prefect use case for a NoSQL DB like Cassandra or MongoDB.

(Posted as comment initially, but I think it might be worthy to be an answer...)

Community
  • 1
  • 1
ppeterka
  • 20,583
  • 6
  • 63
  • 78
1

Well, what's the problem? Representing trees as relations is simple.

NODE ( id, tag-name, text )
ATTR ( id, attr-name, attr-value )
NODEATRR ( node-id, attr-id )
NODENODE ( node-id, child-node-id )

Keys and connections between relations are obvious, I hope. It's ugly and stringly-typed, sure, but that's what you get if you want to store an arbitrary XML.

Joker_vD
  • 3,715
  • 1
  • 28
  • 42
  • Well, what if you have 10000 XML files daily, with 3000 tags each? (FYI, I had to deal with that...) What kind of DB will be able to handle queries with this structure? I know the answer: none... This is not only ugly, but completely unusable. To get a tag in the 10th depth, you have to do 10 JOINS? That is a plain no! (However, as it technically works, I don't vote it down) – ppeterka Oct 30 '12 at 08:38
  • 2
    10000 XML files, 3000 tags each? I'd parse data from them into something reasonable, and store. Or, if I don't need to process it, I'll just store it in VARCHAR. XML is format for *exchanging* data, it isn't supposed to use for storing data! Once you got XML, rip all you need from it and save in a reasonable matter. – Joker_vD Oct 30 '12 at 08:45
  • `I'd parse data from them into something reasonable` You can't do that, remember, the OP doesn't have XML schema data! `XML is format for exchanging data, it isn't supposed to use for storing data` Tell that to Oracle, Microsoft, or even the people behind Postgres. What if you have to store your communications for audit purposes, and at the same time you have to be able to use it as a data warehouse for reporting purpose? Would you keep all the XML files, AND at the same time have them stored in a relational DB schema too? What if OP needs literally every bit of it? – ppeterka Oct 30 '12 at 08:54
  • a) Well, what else can I do except from storing as-is or performing basical parsing? Nothing. b) It depends on if reparsing from stored XML is cheaper than storing two copies of data: XML and parsed. As for Oracel and Microsoft... back in 60es IBM believed hierarchical DBs were all future -- where are those now? Oh wait, right, they are reborn again as XML. – Joker_vD Oct 30 '12 at 08:54
  • `Also, what's so scary about 10 JOINs except awkward syntax?` Well, awkward syntax is something I don't care about - that's the job of the database abstraction layer of choice in modern environments. But performance tends to be very bad for this kind of hierarchical situation. 10 JOINs in itself are nothing. But in this structure, if the XML is properly hierarchical, and let's say only 5 deep, to find out basic information, like who ordered what in a typical web store, you'd have to have 3-4-5 JOINs for EACH attribute! That quickly grows slow. – ppeterka Oct 30 '12 at 09:00
  • 1
    But if you have a "properly hierarchical" XML, with DTD/XML Schema, then you can have more problem-oriented DB schema -- with tables like CUSTOMERS, ORDERS, PRODUCTS etc. I think I should clarify, that the proposed design almost never would be used because there would usually be better approaches. But sometimes you need to store arbitrary XML and match for specific patterns in text/attributes -- though even then a Python script on raw XML files probably would work out better. – Joker_vD Oct 30 '12 at 09:22
  • I agree, with proper schema, there is no need for all this crap. But to my greatest sadness, there are plenty of places (even significant money-making businesses), where having a schema is not the part of day-to-day operations... Sometimes, it is even a requirement to deal with malformed XML (unescaped & for example!). As for Python and raw XML - I don't have benchmark data for that, but my experience with Perl seems to back that in some cases. Problem is, this is not always feasible solution... Note that we're on the same side: we both want a better place to live (code, work) in... – ppeterka Oct 30 '12 at 10:06
0

There are some ingenious tree encoding schemes in SQL. Again, even clever tree encoding is inferior to properly designed database schema.

Community
  • 1
  • 1
Tegiri Nenashi
  • 3,066
  • 2
  • 19
  • 20