3

I currently have a PostgreSQL database, because one of the pieces of software we're using only supports this particular database engine. I then have a query which summarizes and splits the data from the app into a more useful format.

In my MySQL database, I have a table which contains an identical schema to the output of the query described above.

What I would like to develop is an hourly cron job which will run the query against the PostgreSQL database, then insert the results into the MySQL database. During the hour period, I don't expect to ever see more than 10,000 new rows (and that's a stretch) which would need to be transferred.

Both databases are on separate physical servers, continents apart from one another. The MySQL instance runs on Amazon RDS - so we don't have a lot of control over the machine itself. The PostgreSQL instance runs on a VM on one of our servers, giving us complete control.

The duplication is, unfortunately, necessary because the PostgreSQL database only acts as a collector for the information, while the MySQL database has an application running on it which needs the data. For simplicity, we're wanting to do the move/merge and delete from PostgreSQL hourly to keep things clean.

To be clear - I'm a network/sysadmin guy - not a DBA. I don't really understand all of the intricacies necessary in converting one format to the other. What I do know is that the data being transferred consists of 1xVARCHAR, 1xDATETIME and 6xBIGINT columns.

The closest guess I have for an approach is to use some scripting language to make the query, convert results into an internal data structure, then split it back out to MySQL again.

In doing so, are there any particular good or bad practices I should be wary of when writing the script? Or - any documentation that I should look at which might be useful for doing this kind of conversion? I've found plenty of scheduling jobs which look very manageable and well-documented, but the ongoing nature of this script (hourly run) seems less common and/or less documented.

Open to any suggestions.

Troy Alford
  • 26,660
  • 10
  • 64
  • 82
bdx
  • 3,316
  • 4
  • 32
  • 65
  • 1
    What is the reason to duplicate the data? Can't you just use PostgreSQL for the stuff you do with MySQL? –  Sep 15 '12 at 12:00
  • 1
    A scripting language is an unnecessary dependency. Just output it to a text file and import from there. – Clodoaldo Neto Sep 15 '12 at 13:16
  • There are 100s of ways to get this done. I think to get a good answer, you are going to have to give a bit more information. Like what environment are you using? Are the servers on the same box, etc. If it was me, I would just set up a cron job to run a python script every hour. But, I'm familiar with Linux and am a Python developer. – David S Sep 16 '12 at 00:53
  • Fair points from everyone, I'll edit my question at the bottom to provide clarifications. – bdx Sep 17 '12 at 01:01
  • Both postgresql and mysql have appropriate tools to import/export from/to text files. The postgresql one is [COPY](http://www.postgresql.org/docs/current/static/sql-copy.html) and the mysql's is [mysqlimport](http://dev.mysql.com/doc/refman/5.5/en/mysqlimport.html). Other than special situations I guess it will be just a matter of getting the syntax right. – Clodoaldo Neto Sep 17 '12 at 11:41

2 Answers2

3

Use the same database system on both ends and use replication

If your remote end was also PostgreSQL, you could use streaming replication with hot standby to keep the remote end in sync with the local one transparently and automatically.

If the local end and remote end were both MySQL, you could do something similar using MySQL's various replication features like binlog replication.

Sync using an external script

There's nothing wrong with using an external script. In fact, even if you use DBI-Link or similar (see below) you probably have to use an external script (or psql) from a cron job to initiate repliation, unless you're going to use PgAgent to do it.

Either accumulate rows in a queue table maintained by a trigger procedure, or make sure you can write a query that always reliably selects only the new rows. Then connect to the target database and INSERT the new rows.

If the rows to be copied are too big to comfortably fit in memory you can use a cursor and read the rows with FETCH, which can be helpful if the rows to be copied are too big to comfortably fit in memory.

I'd do the work in this order:

  • Connect to PostgreSQL
  • Connect to MySQL
  • Begin a PostgreSQL transaction
  • Begin a MySQL transaction. If your MySQL is using MyISAM, go and fix it now.
  • Read the rows from PostgreSQL, possibly via a cursor or with DELETE FROM queue_table RETURNING *
  • Insert them into MySQL
  • DELETE any rows from the queue table in PostgreSQL if you haven't already.
  • COMMIT the MySQL transaction.
  • If the MySQL COMMIT succeeded, COMMIT the PostgreSQL transaction. If it failed, ROLLBACK the PostgreSQL transaction and try the whole thing again.

The PostgreSQL COMMIT is incredibly unlikely to fail because it's a local database, but if you need perfect reliability you can use two-phase commit on the PostgreSQL side, where you:

  • PREPARE TRANSACTION in PostgreSQL
  • COMMIT in MySQL
  • then either COMMIT PREPARED or ROLLBACK PREPARED in PostgreSQL depending on the outcome of the MySQL commit.

This is likely too complicated for your needs, but is the only way to be totally sure the change happens on both databases or neither, never just one.

BTW, seriously, if your MySQL is using MyISAM table storage, you should probably remedy that. It's vulnerable to data loss on crash, and it can't be transactionally updated. Convert to InnoDB.

Use DBI-Link in PostgreSQL

Maybe it's because I'm comfortable with PostgreSQL, but I'd do this using a PostgreSQL function that used DBI-link via PL/Perlu to do the job.

When replication should take place, I'd run a PL/PgSQL or PL/Perl procedure that uses DBI-Link to connect to the MySQL database and insert the data in the queue table.

Many examples exist for DBI-Link, so I won't repeat them here. This is a common use case.

Use a trigger to queue changes and DBI-link to sync

If you only want to copy new rows and your table is append-only, you could write a trigger procedure that appends all newly INSERTed rows into a separate queue table with the same definition as the main table. When you want to sync, your sync procedure can then in a single transaction LOCK TABLE the_queue_table IN EXCLUSIVE MODE;, copy the data, and DELETE FROM the_queue_table;. This guarantees that no rows will be lost, though it only works for INSERT-only tables. Handling UPDATE and DELETE on the target table is possible, but much more complicated.

Add MySQL to PostgreSQL with a foreign data wrapper

Alternately, for PostgreSQL 9.1 and above, I might consider using the MySQL Foreign Data Wrapper, ODBC FDW or JDBC FDW to allow PostgreSQL to see the remote MySQL table as if it were a local table. Then I could just use a writable CTE to copy the data.

WITH moved_rows AS (
    DELETE FROM queue_table RETURNING *
)
INSERT INTO mysql_table
SELECT * FROM moved_rows;
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • I like the idea of using a foreign data wrapper to have the MySQL table appear as if it was local to the postgres database. Are you sure you're able to insert data into foreign tables though? Going by this information page http://www.postgresql.org/docs/9.1/static/ddl-foreign-data.html it seems like you are only able to currently select from foreign tables, not insert into them. Do you know if this is definitely the case? – bdx Sep 18 '12 at 04:39
  • And in regards to the MySQL table storage, we're already running InnoDB. – bdx Sep 18 '12 at 05:04
  • @bdx Good point, I don't know if writing to FDWs is currently supported. I should've checked that. You might be stuck with DBI-Link or an external script for now. – Craig Ringer Sep 18 '12 at 05:18
  • Ended up going the route of external script, your answer saved me a lot of time and gave me a much better understanding, thanks for the effort on this one. – bdx Sep 24 '12 at 06:42
  • 1
    With Postgresql 9.3 you can insert into foreign tables. – kgilpin Sep 11 '14 at 15:12
0

In short you have two scenarios:

1) Make destination pull the data from source into its own structure

2) Make source push out the data from its structure to destination

I'd rather try the second one, look around and find a way to create postgresql trigger or some special "virtual" table, or maybe pl/pgsql function - then instead of external script, you'll be able to execute the procedure by executing some query from cron, or possibly from inside postgres, there are some possibilities of operation scheduling. I'd choose 2nd scenario, because postgres is much more flexible, and manipulating data some special, DIY ways - you will simply have more possibilities.

External script probably isn't a good solution, e.g. because you will need to treat binary data with special care, or convert dates&times from DATE to VARCHAR and then to DATE again. Inside external script, various text-stored data will be probably just strings, and you will need to quote it too.

Piotr Wadas
  • 1,838
  • 1
  • 10
  • 13
  • One word more - doing it from inside of PGSQL has also one more advantage - as you say applications are putting the data into db all the time, then it'll be better to be able push out data from the side which, at least theoretically, knows what is happening to the data in time when it comes to synchronize/push out. – Piotr Wadas Sep 15 '12 at 13:34
  • Can you link to any documentation that explains how I might do this from withing PGSQL? I've had a bit of a search around and I can't see how I'd establish a database connection to a completely separate database engine from within it. – bdx Sep 17 '12 at 01:00
  • Obviously you're going to need pl/PgSQL for that. Here you'll find some information about task scheduling http://www.postgresonline.com/journal/archives/19-Setting-up-PgAgent-and-Doing-Scheduled-Backups.html , and here about pl/PgSQL: http://www.postgresql.org/docs/9.2/interactive/plperl-funcs.html – Piotr Wadas Sep 17 '12 at 07:18
  • tip: you can use external perl modules to connecting to other databases , when you create "untrusted" plperl ( which actually means it has access to external files ). However, as Clodoaldo suggested, this will require a little bit of research and work, and will be quite complicated, so it's up to you how much time you can spent, and how important the task is. – Piotr Wadas Sep 17 '12 at 07:22