I'm trying to choose the best NoSQL database service for my data.
Most NoSQL databases out there only support UTF-8, and there's no way to enforce an encoding, unlike relational dbs. And the problem is that UTF-8 uses one byte only for the first 127 characters, but uses two bytes for the next 128, and these are the characters that comprise 80% of my data (don't ask me why I have more of these squiggles than the actual English alphabet, it's a long-winded answer).
I'll have to perform lots of queries with regular expressions on those strings, that look like "àñÿÝtçh" and are made up of mostly characters 128-255 in ISO-8859-1, ISO-8859-15 or Windows-1252. I know that I will never need to store characters outside that range in my database, so it's safe to work with only 256 characters and miss out on the gazillion characters UTF-8 supports. I am also aware that ISO-8859-1 will create lots of compatibility problems with JSON objects and things like that. However, we will be running lots of regex queries, and those can be quite complex, and the extra cost of doubling the bytes just because I have no choice but to use UTF-8 may have a negative impact on performance.
I understand that NoSQL databases tend to be schema-less, and fields are not normally defined with data types and encodings, but NoSQL will suit our project much better than SQL. Cassandra stores strings in 1-byte US-ASCII for the 0-127 lot, and not 1-byte UTF-8. Is there any NoSQL out there that defaults to ISO-8859-1 or ISO-8859-15 for the 0-255 lot?