pg_comparator - network and time efficient table content comparison.
pg_comparator [options as --help --option --man] conn1 conn2
This script performs a network and time efficient comparison of two possibly large tables on two servers. It makes only sense to use it if the expected differences are relatively small. Optionally, it can synchronize the tables.
The implementation is quite generic: multi-column keys, no assumption of data types other that they can be cast to text, subset of columns can be used for the comparison, handling of NULL values...
Options allow to request help or to adjust some internal parameters. Short one-letter options are also available, usually with the first letter of the option name.
--aggregate=agg or -a aggAggregation function to be used for summaries, 'xor' or 'sum'. Must operate on the result of the checksum function. Default is 'xor'.
--ask-passAsk for passwords interactively.
--assume-size=...Assume the provided value as the table size, thus skipping the COUNT query.
--checksum-function=fun or --cf=fun or -c funChecksum function to use. Must be ck or md5. Default is ck, which is faster, especially if the operation is cpu-bound and the bandwidth is high.
--checksum-size=n or --check-size=n or --cs=n or -z nChecksum size, 2 4 or 8 bytes. Default is 8.
--cleanupDrop checksum and summary tables beforehand.
Useful after a run with --no-temp.
--clearDrop checksum and summary tables explicitely after the computation. Note that they are dropped implicitely by default when the connection is closed because they are temporary.
--expect n or -e nTotal number of differences to expect (updates, deletes and inserts), used for non regression tests.
--folding-factor=8 or -f 8Folding factor: log2 of the number of rows grouped together at each stage. Default chosen after some basic tests on medium-size cases. The minimum value of 1 builds a binary tree. Good values for large databases and low bandwidth is 10 to 12.
--help or -hShow short help.
--man or -mShow manual page.
--max-report=1024Maximum search effort, search will stop if above this threshold: it gives up if a single query at any level involves more than this many differences. Use 0 for no limit.
--max-levels=0Maximum number of levels used. Allows to cut-off folding. Default is 0, meaning no cut-off. Setting a value of 1 only uses the checksum table, without summaries.
--null=howHow to handle NULL values. Either hash to hash all values, where NULL
has one special hash value, or text where NULL values are substituted
by the NULL string. Default is text because it is faster.
--option or -oShow option summary.
--prefix=cmpName prefix for comparison tables. May be schema-qualified.
--report, --no-reportReport keys as they are found. Default is to report.
--separator=: or -s :Separator string or character when concatenating key columns. This character should not appear in the values.
--temporary, --no-temporaryWhether to use temporary tables. Default is to use. If you don't, the tables are kept at the end, so they will have to be deleted by hand.
--threads or -TUse threads: a synonymous for "segmentation fault":-) It seems that DBI or DBD::Pg does not like threads at all... See bug report http://rt.cpan.org/Public/Bug/Display.html.
--statsShow various statistics.
--synchronize or -SActually perform operations to synchronize the second table wrt the first.
Well, not really. It is only done if you add --do-it or -D.
Save your data before attempting anything like that!
--use-key or -uWhether to use the key of the tables to distribute tuples among branches. The key must be simple, integer, not NULL, and evenly distributed. Default is to hash the key, so as to handle any type, composition and distribution.
--transaction, --no-transactionWhether to wrap the whole algorithm in a single transaction. It may be a little quicker in a transaction, so it is the default.
--verboseBe verbose about what is happening. The more you ask, the more verbose.
--versionShow version information.
--where=...SQL boolean condition on table tuples for partial comparison. Useful to reduce the load if you know that expected differences are in some particular parts of your data, say those timestamped today...
The two arguments describe database connections with the following URL-like
syntax, where square brackets denote optional parts. Many parts are optional
with a default. The minimum syntactically correct specification is /.
See also the EXAMPLES section bellow.
[driver://][login[:pass]@][host[:port]]/[base/[[schema.]table[?key:cols]]]
Database driver to use. Use pgsql for PostgreSQL, and mysql for MySQL. Heterogeneous databases may be compared and synchronized, however beware that subtle typing, encoding and casting issues may prevent heterogeneous comparisons or synchronizations to succeed. Default is pgsql.
Login to use when connecting to database. Default is username for first connection, and same as first connection for second.
Password to use when connecting to database. Note that it is a bad idea to put a password as an command argument. Default is none for the first connection, and the same password as the first connection for the second if the connection targets the same host, port and uses the same login. See also --ask-pass option.
Hostname to connect to. Default is localhost.
Tcp-ip port to connect to. Default is 5432 for PostgreSQL and 3306 for MySQL.
Database catalog to connect to. Default is username for first connection. Default is same as first connection for second connection.
The possibly schema-qualified table to use for comparison. No default for first connection. Default is same as first connection for second connection.
Comma-separated list of key columns. No default for first connection. Default is same as first connection for second connection.
Comma-separated list of columns to compare. May be empty. No default for first connection. Default is same as first connection for second connection.
The output of the command consists of lines describing the differences found between the two tables. They are expressed in term of insertions, updates or deletes and of tuple keys.
Key k tuple is updated from table 1 to table 2. It exists in both tables with different values.
Key k tuple does not appear in table 2, but only in table 1. It must be inserted in table 2 to synchronize it wrt table 1.
Key k tuple appears in table 2, but not in table 1. It must be deleted from 2 to synchronize it wrt table 1.
In case of key-checksum or data-checksum collision, false positive or false negative results may occur. Changing the checksum function would help in such cases.
Three support functions are needed on the database:
The COALESCE function takes care of NULL values in columns.
A checksum function must be used to reduce and distribute key and columns values. It may be changed with the --checksum option. Its size can be selected with the --checksize option (2, 4 or 8 bytes).
Suitable implementations are available for PostgreSQL and can be loaded into
the server by processing share/contrib/checksum.sql. New checksums and
casts are also available for MySQL, see mysql_*.sql.
An aggregate function is used to summarize checksums for a range of rows. It must operate on the result of the checksum function. It may be changed with the --aggregate option.
Suitable implementations of a exclusive-or xor aggregate are available
for PostgreSQL and can be loaded into the server by processing
share/contrib/xor_aggregate.sql.
Moreover several perl modules are useful to run this script:
Getopt::Long for option management.
DBI,
DBD::Pg to connect to PostgreSQL,
and DBD::mysql to connect to MySQL.
Term::ReadPassword for ask-pass option.
Pod::Usage for doc self-extraction (--man --opt --help).
Compare tables calvin and hobbes in database family on localhost, with key id and columns c1 and c2:
./pg_comparator /family/calvin?id:c1,c2 /family/hobbes
Compare tables calvin in default database on localhost and the same table in default database on sablons, with key id and column data:
./pg_comparator localhost/family/calvin?id:data sablons/
The aim of the algorithm is to compare the content of two tables, possibly on different remote servers, with minimum network traffic. It is performed in three phases.
A checksum table is computed on each side for the target table.
A fist level summary table is computed on each side by aggregating chunks of the checksum table. Other levels of summary aggregations are then performed till there is only one row in the last table, which then stores a global checksum for the whole initial target tables.
Starting from the upper summary tables, aggregated checksums are compared from both sides to look for differences, down to the initial checksum table. Keys of differing tuples are displayed.
The first phase computes the initial cheksum table t(0) on each side. Assuming that key is the table key columns, and cols is the table data columns that are to be checked for differences, then it is performed by querying target table t as follow:
CREATE TABLE t(0) AS SELECT key AS id, checksum(key) AS idc, checksum(key || cols) AS cks FROM t;
The inititial key is kept, as it will be used to show differing keys at the end. The rational for the idc column is to randomize the key-values distribution so as to balance aggrates in the next phase. The key must appear in the cheksum also, otherwise content exchanged between two keys would not be detected in some cases.
Now we compute a set of cascading summary tables by grouping f (folding factor) checksums together at each stage. The grouping is based on a mask on the idc column to take advantage of the checksum randomization. Starting from p=0 we build:
CREATE TABLE t(p+1) AS SELECT idc & mask(p+1) AS idc, XOR(cks) FROM t(p) GROUP BY idc & mask(p+1);
The mask(p) is defined so that it groups together on average f
checksums together: mask(0) = ceil2(size); mask(p) = mask(p-1)/f;
This leads to a hierarchy of tables, each one being a smaller summary
of the previous one:
checksum table, size rows, i.e. as many rows as the target table.
first summary table, (size/f) rows.
intermediate summary table, (size/f**p) rows.
one before last summary table, less than f rows.
last summary table, mask is 0, 1 row.
It is important that the very same masks are used so that aggregations are the same, allowing to compare matching contents on both sides.
After all these support tables are built on both sides comes the search for differences. When checking the checksum summary of the last tables (level n) with only one row, it is basically a comparison of the cheksum of the whole table contents. If they match, then both tables are equal, and we are done. Otherwise, if these checksums differ, some investigation is needed to detect offending keys.
The investigation is performed by going down the table hierarchy and looking for all idc for which there was a difference in the checksum on the previous level. The same query is performed on both side at each stage:
SELECT idc, cks FROM t(p) WHERE idc & mask(p+1) IN (idc-with-diff-checksums-from-level-p+1) ORDER BY idc, cks;
And the results from both sides are merged together. When doing the merge procedure, four cases can arise:
Both idc and cks match. Then there is no difference.
Although idc does match, cks does not. Then this idc is to be investigated at the next level, as the checksum summary differs. If we are already at the last level, then the offending key can be shown.
No idc match, one supplemental idc in the first side. Then this idc correspond to key(s) that must be inserted for syncing the second table wrt the first.
No idc match, one supplemental idc in the second side. Then this idc correspond to key(s) that must be deleted for syncing the second table wrt the first.
Cases 3 and 4 are simply symmetrical, and it is only an interpretation to decide whether it is an insert or a delete, taking the first side as the reference.
The checksum implementation gives integers, which are constant length and easy to manipulate afterwards.
The xor aggregate is a good choice because there is no overflow issue with it and it takes into account all bits of the input.
NULL values must be taken care appropriatelly.
The folding factor and all modules are taken as power of two...
There is a special management of large chunks of deletes or inserts which is implemented although not detailed in the algorithmic overview above nor the complexity analysis below.
Let n be the number of rows, r the row size, f the folding factor and k the number of differences to be detected. Then ISTM that:
is better than k*f*ceil(log(n)/log(f)): it is independent of r, the lower f the better, and you want k<<n.
maximum is 6+2*ceil(log(n)/log(f)), minimum is 6+ceil(log(n)/log(f)) for equal tables.
is about n*r+n*ln(n)*(f/(f-1)). Here a not too small f is better, as it reduces both the number of requests and of disk I/Os;
The choice of f is indeed a tradeoff.
This script and algorithm was somehow inspired by:
Taming the Distributed Database Problem: A Case Study Using MySQL by Giuseppe Maxia in Sys Admin vol 13 num 8, Aug 2004, pp 29-40. See http://www.perlmonks.org/index.pl for details.
In the above paper, three algorithms are presented. The first one compares two tables with a checksum technique. The second one finds UPDATE or INSERT differences based on a 2-level (checksum and summary) table hierarchy. The algorithm is asymmetrical, as different queries are performed on the two tables to be compared. It seems that the network traffic volume is in k*(f+(n/f)+r), that it has a probabilistically-buggy merge procedure, and that it makes assumptions about the distribution of key values. The third algorithm looks for DELETE differences based on counting, with the implicit assumption that there are only such differences.
The algorithm used here implements all three tasks. It is fully symmetrical. It finds UPDATE, DELETE and INSERT between the two tables. The checksum and summary hierarchical level idea is reused and generalized so as to reduce the algorithmic complexity.
From the implementation standpoint, the script is as parametric as possible thru many options, and makes as few assumptions as possible about table structures, types and values.
Michael Nacos made a robust implementation http://pgdba.net/pg51g/ based on triggers. He also noted that although database contents are compared by the algorithm, the database structure differences can also be detected by comparing system tables which describe these.
Benjamin Mead Vandiver's PhD Thesis Detecting and Tolerating Byzantine Faults in Database Systems, Massachusset's Institute of Technology, May 2008 (report number MIT-CSAIL-TR-2008-040). There is an interesting discussion in Chapter 7, and experiments are presented on a Java/JDBC/MySQL implementation of several algorithms.
Some products or projects implement such features, for instance: http://code.google.com/p/maatkit/ (mk-table-sync, by Baron Schwartz, see http://tinyurl.com/mysql-data-diff-algorithm) http://www.programurl.com/software/sql-server-comparison.htm http://www.dbbalance.com/db_comparison.htm http://www.dkgas.com/dbdiff.htm http://www.sql-server-tool.com/ http://sourceforge.net/projects/mysqltoolkit http://www.citrustechnology.com/solutions/data-comparison http://www.dbsolo.com/datacomp.html http://comparezilla.sourceforge.net/ http://www.webyog.com/
All softwares have bugs. This is a software, hence it has bugs.
Reporting bugs is good practice, so tell me if you find one!
The implementation does not do many sanity checks. For instance, it does not check that the declared key is indeed a key.
Do not attempt to synchronize while the table is being used! Maybe I should lock the table?
Although the algorithm can work with some normalized columns (say strings are trimmed, lowercased, unicode normalized...), the implementation may not work at all.
Tables with binary keys or with NULL in keys may not work.
See http://pgfoundry.org/projects/pg-comparator/ for the latest version. My web site for the tool is http://www.coelho.net/pg_comparator/.
Add more functions (MD5, SUM) and sizes (2, 4, 8). Remove template parametrization which is much too fragile to expose. Add a wrapping transaction which may speed up things a little. Implementation for MySQL, including synchronizing heterogeneous databases. Improved documentation. Extensive validation/non regression tests.
More documentation. Improved connection parsing with more sensible defaults. Make the mask computation match its above documentation with a bottom-up derivation, instead of a simpler top-down formula which results in bad performances when a power of the factor is close to the size (as pointed out in Benjamin Mead Vandiver's PhD). This bad mask computation was introduced somehow between 1.3 and 1.4 as an attempt at simplifying the code.
More documentation.
Add --expect option for non regression tests.
Add more links. Fix so that with a key only (i.e. without additional columns), although it could be optimized further in this case. Integrate patch by Erik Aronesty: More friendly "connection parsor". Add synchronization option to actually synchronize the data.
Manual connection string parsing.
Grumble! wrong tar pushed out.
Minor makefile fix asked for by Roberto C. Sanchez.
Minor fix for PostgreSQL 8.3 by Roberto C. Sanchez.
Port to PostgreSQL 8.2. Better documentation. Fix mask bug: although the returned answer was correct, the table folding was not. DELETE/INSERT messages exchanged so as to match a 'sync' or 'copy' semantics, as suggested by Erik Aronesty.
Project moved to http://pgfoundry.org/. Use cksum8 checksum function by default. Minor doc updates.
Added --show-all-keys option for handling big chunks of deletes or inserts.
Fix algorithmic bug: checksums must also include the key, otherwise exchanged data could be not detected if the keys were to be grouped together.
Algorithmic section added to manual page. Thanks to Giuseppe Maxia who asked for it.
Various code cleanups.
Initial revision.
Copyright (c) 2004-2010, Fabien Coelho <pg dot comparator at coelho dot net> http://www.coelho.net/
This softwere is distributed under the terms of the BSD Licence. Basically, you can do whatever you want, but you have to keep the license... and I'm not responsible for any consequences. Beware, you may lose your data or your hairs because of this software! See the LICENSE file enclosed with the distribution for details.
If you are very happy with this software, I would appreciate a postcard saying so (see my webpage for current address).