Quick Outline of ErServer Internals
by Andrew Sullivan
23-Oct-2003
Abstract
This is a quick conceptual outline of what erserver does once you have it
working, if you ever get it there. That is, this is not an outline
of how the setup scripts, etc., work (or don't), but how a tuple on
the master ends up on the slave(s).
I. Required support
A. On the master
The master must have the following tables set up:
- _rserv_log_1_
- _rserv_log_2_
- _rserv_sync_
- _rserv_tables_
Any other tables are not strictly required. There is a table that
normally gets set up by the system, _rserv_servers_, but it is never
actually used. It was apparently a stub for some planned features.
There is a server number in _rserv_sync_, and you could therefore
in principle use _rserv_servers_ for convenient lookups, but it won't
get used by anything.
The master must also have a couple of installed sequences:
- _rserv_active_log_id_
- _rserv_old_log_status_
- _rserv_sync_seq_
Again, there is a sequence which exists for _rserv_servers_. It's
not actually used.
The function for the replication log trigger must be created. It's
generally called _rserv_log_, but the important detail is that it is
what is called by the trigger created on each replicated table, and
that it in turn is a C function which is compiled from
src/erserver.c; it should compile to a function called erserver.so,
which will be in the directory where you installed erserver. (In
case you're wondering, by the way, yes this breaks the nice $libdir
trick in 7.2 and later.)
B. On the slave
On the slave, you need exact copies of the tables you are going to
replicate, plus you need _rserv_slave_tables_ and _rserv_slave_sync_.
C. General
All this is set up as a database super user. In fact, you really
only need to be super user to install the function, because it's a C
function. In practice, it's easier to set up the relevant bits as
super user, and then make some limited GRANTs to the relevant tables.
II. Queueing things to be replicated
A table which is to be replicated must meet several conditions on the
master.
A. It must be listed in the table _rserv_tables_:
Column | Type | Modifiers |
tname | name | |
cname | name | |
reloid | oid | |
key | integer | |
The tname is the table name. The cname is the unique, not null key
of the table for replication purposes (the setup script creates
_ers_uniq, but there's nothing requiring that; if you had
single-column primary keys on all your tables, you could use them).
The reloid is the oid in pg_class for that table (i.e. select a.oid
from pg_class a, _rserv_tables_ b where a.relname = b.tname and
b.tname = [target table]), and the key is the attnum from
pg_attribute for that field (i.e. select attnum from pg_attribute a,
_rserv_tables_ b where a.attname = b.cname and a.attrelid = b.reloid
and b.tname = [target table]; note that you'd have to have reloid set
correctly for this to work).
B. Trigger Installed
On the master, it must have the replication trigger installed.
It is normally called _rserv_trigger_t_, but the name is not important.
What _is_ important is that the trigger calls the _rserv_log_ C
function (the actual relevant bit is that the function which is
called is the function, distributed with the erserver source, which
compiles to rserv.so).
C. Permissions
Users writing to a replicated table must have permissions to read
and write on _rserv_log_1_ and _rserv_log_2_, and must be able to select
from _rserv_active_log_id_.
D. Trigger action
Suppose we are replicating target_table, then. We have the
relevant entries and the trigger is set up. When you insert a record
into target_table, you will cause the trigger to fire. It will
insert into the current log table (see below about cleanlog, but if
this is the first time, it should be _rserv_log_1_) a record with the
reloid (see _rserv_tables_), the logid (which is the xid from the
transaction), the timestamp, whether the record is deleted (-1 is an
insert, 0 is an update, 1 a delete), and the value of the key (that
is, the actual value in the key field for that row).
III. Replicating
A. Things on the slave
In order to replicate to a slave, you must have set up
_rserv_slave_tables_ on the slave analogously to how it is set up on
the master. Note that the oids are unlikely to be the same, so you
can't just copy the values from the master. Also, your slave tables
must have _exactly_ the same structure as they do on the master. In
particular, the key field must be available on the slave. Some
versions of the setup scripts apparently have done things in the
wrong order, and the _ers_uniq field is not correctly set up on the
slave. You'll be wanting to fix that. If you don't have the slave
completely set up before the sync starts, you will lose data.
B. Configuration file.
If people ask for an annotation of the config file, I'll provide one.
I think the comments are not too bad in there, though. It should be
relatively obvious if you have a misconfiguration there. (Send mail
to the erserver-general list
if not.)
C. Run the replication engine
You _must_ use the Sun JDK to run this software: apparently there
are bits of it which depend on special Sun bits. Java is only
actually portable if it's written that way, and this isn't. Patches
are welcome. The logs normally end up /path/to/erserver/logs.
D. Worth noting
If you look at the query generated by the replication engine, you
will note that it joins _rserv_log_1_ (or _2_) to the replicated
tables, and replicates the version of the row in the table. That
means that _old versions of the row are not replicated_. This is
efficient, but can surprise you if you're not expecting it.
E. What the engine does
The engine looks at _rserv_sync_ for the master and the current
thread's target slave, and goes through the current log (either
_rserv_log_1_ or _2_, depending on the value in _rserv_active_log_id_
and the state of _rserv_old_log_status_) and gets everything that is
affected _after_ the most recent successful sync for the target
server, and which is not currently active. It joins that result to
the tables in question (using LEFT JOIN so that it gets null rows for
the missing ones), and then applies all the affected rows in the
result set for each table to the slave. All this is done in one
transaction on the master and one on the slave. The master commits
last (which means it is possible that the master crashes without
knowing a slave has been updated, but not that the slave crashes when
the master thinks rows have been committed, but haven't been).
IV. CleanLog
A. What is a CleanLog?
The erserver engine uses more than one log into which it writes in
order to keep track of the rows to be replicated. That could be
inefficient, or it could cause maintenance to have to be done
regularly. Instead, a strategy of log rotation is used.
B. How does it work?
The _rserv_log_ trigger inserts its results into whichever table is
dictated by the current value of _rserv_active_log_id_. So, when it
is time to switch logs, the clean log portion of the program just
selects nextval('_rserv_active_log_id_'). The sequence cycles, so it
returns to 1 after 2.
Of course, immediately after the switch, there is a period in which
old transactions still are using _rserv_log_1_ while _rserv_log_2_ is
filling up. The replication engine knows to use both logs by the
value of _rserv_old_log_status_. Once it has caught up all the
slaves with all the items in _rserv_log_1_, it sets the
_rserv_old_log_status_ back to normal, and then clean log knows it
can proceed with issuing a TRUNCATE on _rserv_log_1_.
The interval between clean log events is controlled in the
configuration file by replic.server.CleanLogInterval. The default is
once every 24 hours. You'll find that is too infrequent on a busy
database.
That's everything I can think of for now. I hope these notes are
useful. Feel free to ask questions.
Andrew Sullivan
andrew at libertyrms.info
Formatting by A. Elein Mustain