What would be the optimal way to store large amounts of Unicode text?

Question

I am developing a project where I need to store around 15k Unicode characters. What would be the best way to store this?

The main application is in C# and some other data is stored in a SQL Server DB. This huge amount of text needs to in someway be identifiable by a randomly generated entry key and a category key. Obviously, there may/should be more than one entry that has the same category key.

These entries will be added, retrieved, and also searched using keywords by category key.

I am currently looking at the following 2 ways: (Other ideas more than welcome)

Files

Each category key represented as a folder and each entry as a file using the entry key as the file name.

To search I would just use the Apache Lucene.Net project to build an index and just search by it.

SQL Server

Just stored as another column of type NVARCHAR(MAX) in a table.

Which of these ways is best? I am looking for other options, and pros/cons about these.

score 3 · Accepted Answer · answered Oct 01 '15 at 04:08

3

To answer your question, you have to answer this questions:

Will you store data more than 2 GB? Max data in nvarchar(max) is 2 GB.
Will you manipulate with this data inside sql server (full-text search, grouping, and so on)? You can't join or group by data from files.
Do you need transactional opertations? You can add file and failed to add record to DB and vice version.

So, assuming you have answers on this questions you can decide. My advice - store large data in files or other blob storage (azure blob, amazone and so on) and has a table with list of this files.

Pros:

Small database size - easy to backup, easy to restore
Fast queries to file-list table (counts, joining, grouping and so on)

Cons:

You need to keep in sync your database and files storage
You have non-transactional operation, but it can be ignored by order of operations: save (or delete) file and then make changes in DB. So if you failed on DB, just start operation from the first step.

answered Oct 01 '15 at 04:08

Backs

24,430
5
58
85

1. No. Like I said in my original question, much less than that. 2. No. This data does not need to be manipulated in any way. It just needs to be there and be searchable. 3. No. I don't need transactional operations. If any type of failure occurs while inserting, the file simply will not be created. – Maxwell175 Oct 01 '15 at 04:45
So, at the moment the Pros of Files overweight the Pros of DB. – Maxwell175 Oct 01 '15 at 05:01
@MDTech.us_MAN be carefull, my answer is really depends on my view of situation, I may not know all your requirements. So, think twice also – Backs Oct 01 '15 at 05:41
I will be sure to do so. – Maxwell175 Oct 01 '15 at 19:47

Richard Schneider · Answer 2 · 2015-10-01T04:32:39.807

0

it's much easier having all data in one datastore. I would go with the SQL server solution.

However, if you are primarily concerned with storage space and the text is mainly ASCII, then encoding as UTF-8 would save ~50%. SQL server does not support UTF-8, only UTF-16 (UCS-2). So saving a separate file could have benefits.

edited Oct 01 '15 at 04:32

answered Oct 01 '15 at 04:01

Richard Schneider

34,944
9
57
73

Everything needs to be saved as Unicode. I would like to make sure that this part would not be a limiting factor in terms of supported languages/characters. – Maxwell175 Oct 01 '15 at 04:48
UTF-8 is a loss less encoding. All languages/characters can be used. It is biased towards ASCII for efficiency. In UTF-8, an English character (U+0000 - U+007F) uses only 1 byte, whereas a Chinese character typically requires 3 bytes. In UTF-16 both English and Chinese typically require 2 bytes. – Richard Schneider Oct 01 '15 at 05:07
Ahh, I see now. I have misunderstood the concept of UTF-8. This cleared it up for me: http://stackoverflow.com/a/643706/1610754. To sum it up, I am leaning toward files at the moment. – Maxwell175 Oct 01 '15 at 19:45

What would be the optimal way to store large amounts of Unicode text?

Files

SQL Server

2 Answers2