What is the best way to store varying size columns in postres for language translation?

Question

Lets say I create a table in postgres to store language translations. Lets say I have a table like EVENT that has multiple columns which need translation to other languages. Rather than adding new columns for each language in EVENT I would just add new rows in LANGUAGE with the same language_id.

EVENT:

id	EVENT_NAME (fk to LANGUAGE.short)	EVENT_DESCRIPTION (fk to LANGUAGE.long)
0	1	2

LANGUAGE:

language_id	language_type	short (char varying 200)	med (char varying 50)	long (char varying 2000)
1	english	game of soccer	null	null
1	spanish	partido de footbol	null	null
2	english	null	null	A really great game of soccer
2	spanish	null	null	Un gran partido de footbol

If I want the language specific version I would create a parameterized statement and pass in the language like this:

select event.id, name.short, desc.long
from event e, language name, language desc 
where e.id = 0
and e.event_name = name.language_id 
and name.language_type = ?
and e.event_desc = desc.language_id 
and desc.language_type = ?

My first thought was to have just a single column for the translated text but I would have to make it big enough to hold any translation. Setting to 2000 when many records will only be 50 characters seemed wrong. Hence I thought maybe to add different columns with different sizes and just use the appropriate size for the data Im storing (event name can be restricted to 50 characters on the front end and desc can be restricted to 2000 characters).

In the language table only one of the 3 columns (short,med,long) will be set per row. This is just my initial thought but trying to understand if this is a bad approach. Does the disk still reserve 2250 characters if I just set the short value? I read a while back that if you do this sort of thing in oracle it has to reserve the space for all columns in the disk block otherwise if you update the record it would have to do it dynamically which could be slow. Is Postgres the same?

It looks like you can specify a character varying type without a precision. Would it be more efficient (space wise) to just define a single column not specify the size or just a single column and specify the size as 2000?

A single `text` column is what you need. Read [Character Types](https://www.postgresql.org/docs/current/datatype-character.html) in the docs. — klin, Jun 17 '22 at 22:34
See also [Difference between text and varchar (character varying)](https://stackoverflow.com/q/4848964/1995738) — klin, Jun 17 '22 at 22:36
Thanks for the response. I see posts that varchar without n and text are the same performance but can be slightly slower than varchar(n). varchar(n) looks like its stored inline in the disc block. I assume both varchar and text are stored in a separate disc block hence it requires 2 fetches to fetch the record maybe. varchar and text are probably more space efficient though since varchar(2000) looks like it will always pad and take up 2000 bytes of space even for small strings. It seems like a choice between space efficiency (text/varchar) and performance varchar(n). Does that sound right? — George, Jun 17 '22 at 22:55
I think a legitimate option would be varchar(2000) assuming the strings would always be less than that size. I think space efficiency is less important than performance for my application. — George, Jun 17 '22 at 22:58
@George why would you want to prematurely optimize this? It even says that in most cases text is the best choice in postgres. But also in other databases I think it rarely matters whether you use text or varchar. Just make sure you have an index on the column if you filter by its values. — maraca, Jun 17 '22 at 23:00
fair enough. its not really for me - I have to get the spec past a bunch of senior architects who like to ask tough questions. I like to be prepared. — George, Jun 17 '22 at 23:06
I do not think that `text` can be slower than `varchar(n)`. Can you cite the source of this statement? — klin, Jun 17 '22 at 23:08
"VARCHAR is stored inline with the table (at least for the MyISAM storage engine), making it potentially faster when the size is reasonable." https://www.navicat.com/en/company/aboutus/blog/1308-choosing-between-varchar-and-text-in-mysql — George, Jun 18 '22 at 13:35
I think Ill stop overthinking it and probably go with a single text column. Just to clarify. You guys are not saying we should always use text right? There is a valid use for varchar(n) when n is smaller I think. — George, Jun 18 '22 at 13:37
@George - The post in the link is about MySql and does not concern Postgres in any way. — klin, Jun 19 '22 at 13:31

score 3 · Accepted Answer · answered Jun 18 '22 at 00:07

Just use a single column of data type text. That will perform just as good as a character varying(n) in PostgreSQL, because the implementation is exactly the same, minus the length check. PostgreSQL only stores as many characters as the string actually has, so there is no overhead in using text even for short strings.

In the words of the documentation:

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.

score 1 · Answer 2 · answered Jun 18 '22 at 00:15

Use `text`.

With Postgres, unless you really need to put a hard limit on the text size (and usually you do not), use text.

I see posts that varchar without n and text are the same performance but can be slightly slower than varchar(n).

This is incorrect with PostgreSQL 14. text is the most performant.

There is no performance difference among these three types [char, varchar, text], apart from increased storage space when using the blank-padded type [ie. char], and a few extra CPU cycles to check the length when storing into a length-constrained column [ie. varchar, char].

Storage for varchar and text is the same.

The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.

Do you need to reinvent this wheel?

Internationalization and localization tools already exist; they're built into many application frameworks.

Most localization systems work by using a string written in the developer's preferred language (for example, English) and then localizing that string as necessary. Then it becomes a simple hash lookup.

Rather than doing the localization in the database, query the message and then let the application framework localize it. This is simpler and faster and easier to work with and avoids putting additional load on the database.

If you really must do it in the database, here's some notes:

language is a confusing name for your table, perhaps messages or translations or localizations.
Use standard IETF language tags such as en-US.
Any fixed set of text, such as language tags, put into a table for referential integrity.
- You could use an enum, but they're awkward to read and change.
Rather than having short, medium, and long columns, consider having one row for each message and a size column. This avoids a lot of null columns.
- In fact, don't have "size" at all. It's arbitrary and subjective, as is evidenced by your null columns.
Don't prefix columns with the name of the table, fully qualify them as needed.
Separate the message from its localization for referential integrity.

create table languages (
  id serial primary key,
  tag text unique not null
);

-- This table might seem a bit silly, but it allows tables to refer to
-- a single message and enforce referential integrity.
create table messages (
  id bigserial primary key
);

create table message_localizations (
  -- If a message or language is deleted, so will its localization.
  message_id bigint not null references messages on delete cascade,
  language_id int not null references languages on delete cascade,

  localization text not null,

  -- This serves as both a primary key and enforces one localization
  -- per message and language.
  primary key(message_id, language_id)
);

create table events (
  id bigserial primary key,
  name_id bigint not null references messages,
  description_id bigint not null references messages
);

Then join each message with its localization.

select
  events.id,
  ml1.localization as name,
  ml2.localization as description
from events
-- left join so there is a result even if there is no localization. YMMV.
left join languages lt on lt.tag = 'en-US'
left join message_localizations ml1
  on ml1.message_id = name_id and ml1.language_id = lt.id
left join message_localizations ml2
  on ml2.message_id = description_id and ml2.language_id = lt.id

Demonstration.

But, again, you probably want to use an existing localization tool.

Thanks for the reply. My tables were just simplified examples. My solution does not really have an events table. Im aware of services like google and aws translate but they do a bad job understanding context especially for single words or shorter phrases. I'm developing purely a rest service. The client needs the ability to manually set the language translation for any language they prefer (so it makes contextual sense). Most searches will be using a single language and not return all languages for an "event". I will be using java locales for standard language naming. — George, Jun 18 '22 at 13:53
@George "*Im aware of services like google and aws translate but they do a bad job*" You've confused translation with localization. Localization tools take your existing translations and apply them to the text; they do exactly what you're implementing, but faster and better. The localization of text should be done at the application layer *after* the original text has been retrieved from the database. The localizations should not be stored in the database. For an example see [Rails built in localization](https://guides.rubyonrails.org/i18n.html). — Schwern, Jun 19 '22 at 18:20

What is the best way to store varying size columns in postres for language translation?

2 Answers2

Use text.

Do you need to reinvent this wheel?

Use `text`.