0

I'm working on an application that has the following use case:

  • Users upload csv files, which need to be persisted across application restarts
  • The data in the csv files need to be queried/sorted etc
  • Users specify the query-able columns in a csv file at the time of uploading the file

The currently proposed solution is:

  • For small files (much more common), transform the data into xml and store it either as a LOB or in the file system. For querying, slurp the whole data into memory and use something like XQuery
  • For larger files, create dynamic tables in the database (MySQL), with indexes on the query-able columns

Although we have prototyped this solution and it works reasonably well, it's keeping us from supporting more complex file formats such as XML and JSON. There are also a few more niggling issues with the solution that I won't go into.

Considering the schemaless nature of NoSQL databases, I though they might be used to solve this problem. I have no practical experience with NoSQL though. My questions are:

  1. Is NoSQL well suited for this use case?
  2. If so, which NoSQL database?
  3. How would we store csv files in the DB (collection of key-value pairs where the column headers make up the keys and the data fields from each row make up the values?)
  4. How would we store XML/JSON files with possibly deeply hierarchical structures?
  5. How about querying/indexing and other performance considerations? How does that compare to something like MySQL?

Appreciate the responses and thanks in advance!

example csv file:

employee_id,name,address  
1234,XXXX,abcabc  
001001,YYY,xyzxyz  
...  

DDL statement:

CREATE TABLE `employees`(  
  `id` INT(6) NOT NULL AUTO_INCREMENT,  
  `employee_id` VARCHAR(12) NOT NULL,  
  `name` VARCHAR(255),  
  `address` TEXT,  
  PRIMARY KEY (`id`),  
  UNIQUE INDEX `EMPLOYEE_ID` (`employee_id`)  
);  

for each row in csv file

INSERT INTO `employees`  
                (`employee_id`,  
                 `name`,  
                 `address`)  
       VALUES (...);  
Vivek
  • 111
  • 2
  • 10

1 Answers1

2

Not really a full answer, but I think I can help on some points.

For number 2, I can at least give this link that helps sorting out NoSQL implementations.

For number 3, using a SQL database (but should fit as well for a NoSQL system), I would represent each column and each row as individual tables, and add a third table with foreign keys to columns and rows, and with the value of the cell. You get a big table with easy filtering.

For number 4, you need to "represent hierarchical data in a table"

The common approach to this would be to have a table with attributes, and a foreign key to the same table, pointing to the parent, like this for example :

+----+------------+------------+--------+
| id | attribute1 | attribute2 | parent |
+----+------------+------------+--------+
|  0 | potato     | berliner   | NULL   |
| 1  | hello      | jack       | 0      |
| 2  | hello      | frank      | 0      |
| 3  | die        | please     | 1      |
|  4 | no         | thanks     | 1      |
|  5 | okay       | man        | 4      |
|  6 | no         | ideas      | 2      |
|  7 | last       | one        | 2      |
+----+------------+------------+--------+

Now the problem is that, if you want to get, say, all the child elements from element 1, you'll have to query every item individually to obtain its childs. Some other operations are hard, because they need to get a path to the object, traversing many other objects and making extra data queries.

One common workaround to this, and the one I use and prefer, is called modified pre-order tree traversal.

Using this technique, we need an extra layer between the data storage and the application, to fill some extra columns at each structure-altering modification. We will assign to each object three properties : left, right and depth.

The left and right properties will be filled counting each object from the top, traversing all the tree leaves recursively.

This is a vague approximation of the traversal algorithm for left and right (the part with depth can be easily gussed, this is just some lines to add) :

  1. Set the tree root (or the first tree root if there are many) left attribute to 1
  2. Go to its first (or next) child. Set its left attribute to the last number plus one (here, 2)
  3. Does is it have any child ? If yes, go back to number 2. If no, set its right to the last number plus one.
  4. Go to next child, and do the same as in 2
  5. If no more child, go to next child of parent and do the same as in 2

Here is a picture explaining the result we get :

mptt
(source: narod.ru)

Now it is really easier to find all descendants of an object, or all of its ancestors. This can be done with only a single query, using left and right.

What is important when using this is having a good implementation of the layer between the data and the application, handling the left, right and depth attribute. These fields have to be ajusted when :

  • An object is deleted
  • An object is added
  • The parent field of an object is modified

This can be done with a parallel process, using locks. It can also be implemented directly between the data and the application.

See these links for more information about trees :

I personally had great results with django-nonrel and django-mptt the few times I did NoSQL.

Community
  • 1
  • 1
pistache
  • 5,782
  • 1
  • 29
  • 50
  • Thanks for the helpful answer. I'm leaning towards the Closure table concept given [here](http://karwin.blogspot.in/2010/03/rendering-trees-with-closure-tables.html). I was hoping for a simpler NoSQL based solution, but I guess I'll stick with SQL for now. I'm not sure I understand your point about by 3rd question. Could you explain what advantage this method has over what I'm currently doing? – Vivek Nov 09 '12 at 14:12
  • What are you currently doing ? "collection of key-value pairs where the column headers make up the keys and the data fields from each row make up the values" ? Storing so many elements in data fields is not a gooduse of a database. The database must fit the data, not the data fit the database. – pistache Nov 10 '12 at 11:02
  • What I'm currently doing is "For larger files, create dynamic tables in the database (MySQL), with indexes on the query-able columns". Is that the best way to go? – Vivek Nov 13 '12 at 10:01
  • I don't exactly understand what you mean by "create dynamic tables". I would not make a difference between small and large files, I'd put all of that data in the database, using 4 tables : **CSVFile** ; **Column** (with foreign key to CSVFile, and a Boolean field "queryable") ; **Row** (with foreign key to CSVFile) ; **Cell** (with foreign key to Row and foreign key to Column) This makes some big tables, but you get a consistent data schema and using properly optimized SQL queries and procedures you should'nt have any problem. – pistache Nov 13 '12 at 10:32
  • By "create dynamic table" I mean that every time a csv file is uploaded, I issue a DDL statement from my application, which creates a table to hold the data in the csv file. The table will have all the columns that the csv file does, and indexes on the query-able columns. I realize this will not give me a consistent schema the way your suggestion would, but in terms of performance, which would be the better way to go? – Vivek Nov 14 '12 at 11:20
  • How you will store the cell contents ? – pistache Nov 14 '12 at 16:17
  • I have attached an example to the question... – Vivek Nov 16 '12 at 04:38
  • Yes, your schema is better than mine in this situation. I'm sorry I had the head busy with the database representation of sheets where the number of columns was **not fixed**, and then the problem is not the same. – pistache Nov 16 '12 at 08:53