1

I've read the following thread and I was able to make a conversion script (based on C#) that converts all my charset=NONE databases to charset=UTF8 and most of it works great (I still have a few special cases where characters are converted to weird symbols, but that's marginal).

My issue is that I have lots of backup database files (*.fbk) for which I don't know for sure if this is UTF8 or NONE. In the ideal world, my code would handle the conversion once the database is restored from file depending on the fbk file's format, so I only convert when necessary and after restore.

Is this at all possible? Or is there a way to define charset when restoring the database (either via gback of via ADO.NET provider)?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
neggenbe
  • 1,697
  • 2
  • 24
  • 62
  • What exactly is the problem you're trying to solve? A database itself has no character set (other than its default character set, which is only used when creating a new column without explicit character set). Each and every column has a character set, and each column could have a different one. In short there is not general way to determine this. – Mark Rotteveel Jul 15 '21 at 11:33
  • There is no one-for-all database charset, actually every single column has their own. Additionally, when you speak like `all my charset=NONE databases to charset=UTF8` it sounds not "database charset" but a "connection charset" instead, which is a different beast, albeit having some relationsip. You may read more discussion and some queries too in right panel (comments) at https://topanswers.xyz/databases?q=1462 – Arioch 'The Jul 15 '21 at 13:06
  • ok, my default charset was "NONE" everywhere as this is legacy. In current state, I need `UTF8` so I have to convert all my databases to contain `UTF8` data. This will simplify SO many things on client end as I found out... But having dozen of DB files requires me to automate the procedure... – neggenbe Jul 15 '21 at 13:11
  • You better do "data pumping", create metadata-only pure skeleton databases with charsets set for columns and for the app's connection, open two databases old and new one and copy tables row by row. You would have to figure out the order of those tables, or sometimes rows in those tables, analyzing dependencies. Tools like IBExpert (and probably others) do have "extract metadata to SQL script" functions, how correct and modern that function is only test would tell, but for simple DBs it works – Arioch 'The Jul 15 '21 at 13:16
  • `"NONE" everywhere as this is legacy. In current state, I need UTF8` maybe you do not. IF that is legacy i would assume Firebird 2.x or even 1.x. Assuming you had one working client, you probably would have a charset consistently same in every column across all the rows. You would have to detect that charset and then mark the column as such. Granted, if you guessed the charset wrong you can end with "unreadable rows" and "unrestorable backup". But assuming legacy again, you most probably operate all your texts withing some Windows codepage, so it would hopefully work out ok – Arioch 'The Jul 15 '21 at 13:21
  • @Arioch'The this IS what I did already for databases. The question is about backuped databases `.fbk` files (I got them with `gbak`)... So as far as I get it - I should restores these files, convert to `UTF8` using my datapump method, then backup again?! – neggenbe Jul 15 '21 at 15:43
  • 1
    You need to query the system tables to get this information, so you need to have a functioning database to make this determination. You cannot do this from the backup file itself (or at least, doing so would require you to parse the file yourself in a similar way that gbak does to restore it as a database). So yes, you need to restore the database to be able to do this. Now whether you need to backup again depends on your process. – Mark Rotteveel Jul 15 '21 at 16:11

1 Answers1

1

In general, a Firebird database does not have a single character set. Each and every column can have its own character set. So the only thing you can do is try and use heuristics.

  1. Use the database default character set. To be clear, the database default character set is only used when creating a new column when no explicit character set is specified. It is entirely possible for a database to have default character set UTF8, while all columns have character set WIN1251!

    You can find the database default character set with the following query:

    select RDB$CHARACTER_SET_NAME from RDB$DATABASE 
    

    NOTE: If the result is NULL, then that means the default character set is NONE.

  2. Count the different character sets of CHAR, VARCHAR and BLOB SUB_TYPE TEXT columns to see which occurs most:

    select 
      coalesce(cs.RDB$CHARACTER_SET_NAME, 'NONE') as CHARSET, 
      count(*) as CHARSET_COUNT
    from RDB$RELATIONS r
    inner join RDB$RELATION_FIELDS rf
      on rf.RDB$RELATION_NAME = r.RDB$RELATION_NAME
    inner join RDB$FIELDS f 
      on f.RDB$FIELD_NAME = rf.RDB$FIELD_SOURCE
    left join RDB$CHARACTER_SETS cs
      on cs.RDB$CHARACTER_SET_ID = f.RDB$CHARACTER_SET_ID 
    where coalesce(r.RDB$SYSTEM_FLAG, 0) = 0
    and r.RDB$VIEW_BLR is null
    and (f.RDB$FIELD_TYPE in (14, 37) or f.RDB$FIELD_TYPE = 261 and f.RDB$FIELD_SUB_TYPE = 1)
    group by 1
    order by 2 desc
    

As an aside, be aware that if clients have used connection character set NONE, then it is entirely possible that the actual character set of contents of a column may not match the defined character set of that column.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197