Wednesday, November 20, 2013

Firebird: Changing the database collation

Recently I have faced a challenge. I had a big database (about 3GB of data, hundreds of tables) and I needed to change the collation for all the CHAR and VARCHAR columns. How to do it it? Is there an easy way?

Preface:

The database was created with charset WIN1250 and collation either WIN1250 or PXW_CSY (czech sorting, case-sensitive). I needed to change it to WIN_CZ (czech sorting, case-insensitive).

This article presumes that the character set does not change, all that changes is the collation within that character set. Also this article does not deal with databases using domains or foreign keys on the columns you want to alter. This article deals with computed columns, views, indexes and primary keys though. This procedure was tested on Firebird 2.5.2.

So after this short introduction and warning, let us dive into it.

Default setting for newly created columns:

This is easy, type:

alter character set WIN1250 set default collation WIN_CZ;

And from now on all the new columns will have WIN_CZ collation if not overridden by column definition. Now let us deal with the existing ones.

Check the collation and get the collation_id:

The RDB$COLLATIONS table holds all the collations known to the database. The collation names are unique overall and collation ids are unique within character set. The WIN1250 character set has character_set_id=51.

select * from RDB$COLLATIONS where rdb$character_set_id=51;

Now we see that WIN_CZ has collation_id=7.

Make the list of all the columns that need to be changed:

Next we create a list of all the existing columns we want to change. We will make a VIEW that will hold this list like this:

create view VCOLL as
   select rf.rdb$relation_name as relation_name, 

             rf.rdb$field_name as field_name,
             f.rdb$character_set_id as character_set_id,  

             rf.rdb$collation_id as collation_id, 
             f.rdb$field_type as field_type, 
             f.rdb$field_sub_type as field_sub_type
from RDB$RELATION_FIELDS rf
inner join RDB$FIELDS f on rf.rdb$field_source=f.rdb$field_name
where 

       rf.rdb$system_flag=0 
   and f.rdb$character_set_id=51
   and (f.rdb$field_type = 14 or f.rdb$field_type=37 

         or (f.rdb$field_type=261 and f.rdb$field_sub_type=1));

This view now holds all the user-defined columns in the database that are of type CHAR or VARCHAR or BLOB sub_type TEXT and are defined with character set WIN1250. All these restrictions are security measures. The tables used in the view are described in detail below.

Let us do one more check on this view. If everything is fine, we should not have any column where collation_id is null. You can do any other test of course.

select * from VCOLL where collation_id is null;

Empty result set means OK.


Explanation:

RDB$RELATION_FIELDS table contains all the regular columns. It also contains all the computed columns and all the columns of views. 

Every column in RDB$RELATION_FIELDS is linked to a column in RDB$FIELDS. This link may be on 1:1 basis but also several columns in RDB$RELATION_FIELDS may share one column in RDB$FIELDS table. The column names in RDB$RELATION_FIELDS are user-defined, the columns names in RDB$FIELDS are system-defined (eg. RDB$1, RDB$2, ...).


The character set is stored in RDB$FIELDS only. The collation_id is present in both tables but it seems that it is set in RELATION_FIELDS only, in FIELDS it is always NULL (maybe it is used in domains). This is why we will focus on RDB$RELATION_FIELDS.rdb$collation_id only and we will change this attribute.

When a computed column is created based on a real one, the collation_id is copied from the original column. If computed column changes collation, its collation_id is changed. The same is true for the columns of views. The collation_id of the original column is copied by default. If the collation is changed in the view definition, the collation_id of views column is set accordingly.

This means that we want and we need to change the collation_id for all the columns in RELATION_FIELDS, no matter if they are regular columns, computed columns or columns of a view.

Notes: 
  • user-defined columns have system_flag=0
  • CHAR columns have field_type = 14
  • VARCHAR columns have field_type =37
  • BLOB columns have field_type =261

Make an editable list of table-column pairs:

This step will turn the view we already have into a table. The reason for this step is that we may not want to change all the columns (see below) and this way we can easily remove some columns from the list.

create table XCOLL (relation_name varchar(100), field_name varchar(100));
 

insert into XCOLL 
   select relation_name, field_name from VCOLL 
     where coalesce(collation_id,-1)<>7;

Testing the collation_id means that we remove right now all the columns that have desired collation already.

Check PK columns:

Columns that are a part of PK cannot be changed. We have to indetify them, drop the PK, change the collation and restore the PK.

select * from VCOLL vc where 
   exists 
    (select * from RDB$RELATION_CONSTRAINTS rc, RDB$INDEX_SEGMENTS i 
           where vc.relation_name=rc.rdb$relation_name 
                 and rc.rdb$constraint_type='PRIMARY KEY'
                 and rc.rdb$index_name=i.rdb$index_name 

                 and vc.field_name=i.rdb$field_name);

If there are no columns, you are lucky. If there are some, you have to take care of them one by one. Here is an example of how to do it:

1) Find the constraint name of the PK:

select * from RDB$RELATION_CONSTRAINTS where rdb$relation_name='TABLE_NAME';

2) Get and remember the structure of the PK (note that the index name may not be the same as the constraint name. For user-defined constraint-names the index name is the same. For system-defined constraint-names the index-name is different. Both of them can be seen using the above command):

select * from RDB$INDEX_SEGMENTS where rdb$index_name='PKX_NAME';

3) Drop the constraint (note: use the constraint name, not the index name):

alter table TABLE_NAME drop constraint INTEG_563;

3) Change the collation:

update RDB$RELATION_FIELDS set rdb$collation_id=7 
   where rdb$relation_name='TABLE_NAME' and rdb$field_name='COL_NAME';

4) Recreate the PK:

alter table TABLE_NAME add primary key (COL_NAME);

5) Remove table-column pairs from the XCOLL table:

delete from XCOLL where relation_name='TABLE_NAME' and field_name='COL_NAME';


Check UNIQUE columns:

The procedure is the same as with PK columns. Here is how to identify them:

select * from VCOLL vc where 
   exists 
    (select * from RDB$RELATION_CONSTRAINTS rc, RDB$INDEX_SEGMENTS i 
           where vc.relation_name=rc.rdb$relation_name 
                 and rc.rdb$constraint_type='UNIQUE'
                 and rc.rdb$index_name=i.rdb$index_name 

                 and vc.field_name=i.rdb$field_name);

If you need to change them, the procedure is the same, e.g.:

select * from RDB$RELATION_CONSTRAINTS where rdb$relation_name='TABLE_NAME';

alter table TABLE_NAME drop constraint INTEG_563;

update RDB$RELATION_FIELDS set rdb$collation_id=7 
   where rdb$relation_name='TABLE_NAME' and rdb$field_name='COL_NAME';

alter table TABLE_NAME add unique (COL_NAME);

delete from XCOLL where relation_name='TABLE_NAME' and field_name='COL_NAME';

Get the list of indexes:

The existence of an index does not block the collation update. But when the index is used in a query, it will return the "old" results (according to the old collation setting). This is why we need to identify the indexes touched by the change and rebuild them afterwards. Here is how to find them:

select i.rdb$index_name from RDB$INDICES i
   where exists (select * from RDB$INDEX_SEGMENTS s, XCOLL vc 

    where i.rdb$index_name=s.rdb$index_name 
      and i.rdb$relation_name=vc.relation_name 
      and s.rdb$field_name=vc.field_name);

Put down the names of these indexes. I advice do deactivate them before collation change and reactivate them afterwards. Reactivation rebuilds an index. Note: the above query does not find the expression indexes (computed by). These you have to find by other means (using RDB$DEPENDENCIES), e.g. (untested):

select i.rdb$index_name from RDB$INDICES i
   where exists (select * from RDB$DEPENDENCIES s, XCOLL vc
    where i.rdb$index_name=s.rdb$dependent_name and s.rdb$dependent_type=6
      and i.rdb$relation_name=vc.relation_name
      and i.rdb$relation_name=s.rdb$depended_on_name
      and s.rdb$field_name=vc.field_name);


Note: indexes that are part of PK or UNIQUE constraint cannot be deactivated. This is also why we handled these cases separately above.

Change the collation and rebuild the indexes touched by the change:

alter index IX1 inactive;
alter index IX2 inactive;
...
update RDB$RELATION_FIELDS rf set rdb$collation_id=7 
   where exists 
     (select * from XCOLL x 
        where rf.rdb$relation_name=x.relation_name 
          and rf.rdb$field_name=x.field_name);

alter index IX1 active;
alter index IX2 active;
...

Done.

No comments: