Here is my solution using SQLAlchemy. This is a long-blog-like post, I hope that it is something acceptable here, and useful to someone.
Possibly, this also works with other combinations of source and target databases (besides MS SQL Server and PostgreSQL, respectively), although they were not tested.
Workflow (sort of TL;DR)
- Inspect the source automatically and deduce the existing table models (this is called reflection).
- Import previously defined table models which will be used to create the new tables in the target.
- Iterate over the table models (the ones existing in both source and target).
- For each table, fetch chunks of rows from source and insert them into target.
Requirements
Detailed steps
1. Connect to the databases
SQLAlchemy calls engine to the object that handles the connection between the application and the actual database. So, to connect to the databases, an engine must be created with the corresponding connection string. The typical form of a database URL is:
dialect+driver://username:password@host:port/database
You can see some example of connection URL's in the SQLAlchemy documentation.
Once created, the engine will not establish a connection until it is explicitly told to do so, either through the .connect()
method or when an operation which is dependent on this method is invoked (e.g., .execute()
).
con = ms_sql.connect()
2. Define and create tables
2.1 Source database
Tables in the source side are already defined, so we can use table reflection:
from sqlalchemy import MetaData
metadata = MetaData(source_engine)
metadata.reflect(bind=source_engine)
You may see some warnings if you try this. For example,
SAWarning: Did not recognize type 'geometry' of column 'Shape'
That is because SQLAlchemy does not recognize custom types automatically. In my specific case, this was because of an ArcSDE type. However, this is not problematic when you only need to read data. Just ignore those warnings.
After the table reflection, you can access the existing tables through that metadata object.
# see all the tables names
print list(metadata.tables)
# handle the table named 'Troco'
src_table = metadata.tables['Troco']
# see that table columns
print src_table.c
2.2 Target database
For the target, because we are starting a new database, it is not possible to use tables reflection. However, it is not complicated to create the table models through SQLAlchemy; in fact, it might be even simpler than writing pure SQL.
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class SomeClass(Base):
__tablename__ = 'some_table'
id = Column(Integer, primary_key=True)
name = Column(String(50))
Shape = Column(Geometry('MULTIPOLYGON', srid=102165))
In this example there is a column with spatial data (defined here thanks to GeoAlchemy2).
Now, if you have tenths of tables, defining so many tables may be baffling, tedious, or error prone. Luckily, there is sqlacodegen, a tool that reads the structure of an existing database and generates the corresponding SQLAlchemy model code. Example:
pip install sqlacodegen
sqlacodegen mssql:///some_local_db --outfile models.py
Because the purpose here is just to migrate the data, and not the schema, you can create the models from the source database, and just adapt/correct the generated code to the target database.
Note: It will generate mixed class
models and Table
models. Read here about this behavior.
Again, you will see similar warnings about unrecognized custom data types. That is one of the reasons why we now have to edit the models.py file and adjust the models. Here are some hints on things to adjust:
- The columns with custom data types are defined with
NullType
. Replace them with the proper type, for instance, GeoAlchemy2's Geometry
.
When defining Geometry
's, pass the correct geometry type (linestring, multilinestring, polygon, etc.) and the SRID.
- PostgreSQL character types are variable length capable, and SQLAlchemy will map
String
columns to them by default, so we can replace all Unicode
and String(...)
by String
. Note that it is not required, nor advisable (don't quote me on this), to specify the number of characters in String
, just omit them.
- You will have to double check, but, probably, all
BIT
columns are in fact Boolean
.
- Most numeric types (e.g.,
Float(...)
, Numeric(...)
), likewise for character types, can be simplified to Numeric
. Be careful with exceptions and/or some specific case.
- I have noticed some issues with columns defined as indexes (
index=True
). In my case, because the schema will be migrated, these should not be required now and could be safely removed.
- Make sure the table and column names are the same in both databases (reflected tables and defined models), this is a requirement for a later step.
Now we can connect the models and the database together, and create all the tables in the target side.
Base.metadata.bind = postgres
Base.metadata.create_all()
Notice that, by default, .create_all()
will not touch existing tables. In case you want to recreate or insert data into an existing table, it is required to DROP
it beforehand.
Base.metadata.drop_all()
3. Get data
Now you are ready to copy data from one side and, later, paste it into the other. Basically, you just need to issue a SELECT
query for each table. This is something possible and easy to do over the layer of abstraction provided by SQLAlchemy ORM.
data = ms_sql.execute(metadata.tables['TableName'].select()).fetchall()
However, this is not enough, you will need a little bit more of control. The reason for that is related to ArcSDE. Because it uses a proprietary format, you can retrieve the data but you cannot parse it correctly. You would get something like this:
(1, Decimal('0'), u' ', bytearray(b'\x01\x02\x00\x00\x00\x02\x00\x00\x00@\xb1\xbf\xec/\xf8\xf4\xc0\x80\nF%\x99(\xf9\xc0@\xe3\xa5\x9b\x94\xf6\xf4\xc0\x806\xab>\xc5%\xf9\xc0'))
The workaround here was to convert the geometric column to the Well Known Text (WKT) format. This conversion has to take place in the database side. ArcSDE is there, so it knows how to convert it. So, for example, in the TableName there is a column with spatial data called shape. The required SQL statement should look like this:
SELECT [TableName].[shape].STAsText() FROM [TableName]
This uses .STAsText()
, a geometry data type method of the SQL Server.
If you are not working with ArcSDE, the following steps are not required:
- iterate over the tables (only those that are defined in both the source and in the target),
- for each table, look for a geometry column (list them beforehand)
- build a SQL statement like the one above
Once a statement is built, SQLAlchemy can execute it.
result = ms_sql.execute(statement)
In fact, this does not actually get the data (compare with the ORM example -- notice the missing .fetchall()
call). To explain, here is a quote from the SQLAlchemy docs:
The returned result is an instance of ResultProxy
, which references a
DBAPI cursor and provides a largely compatible interface with that of
the DBAPI cursor. The DBAPI cursor will be closed by the ResultProxy
when all of its result rows (if any) are exhausted.
The data will only be retrieved just before it is inserted.
4. Insert data
Connections are established, tables are created, data have been prepared, now lets insert it. Similarly to getting the data, SQLAlchemy also allows to INSERT
data into a given table through its ORM:
postgres_engine.execute(Base.metadata.tables['TableName'].insert(), data)
Again, this is easy, but because of non-standard formats and erroneous data, further manipulation will probably be required.
4.1 Matching columns
First, there were some issues with matching the source columns with the target columns (of the same table) -- perhaps this was related to the Geometry
column. A possible solution is to create a Python dictionary, which maps the values from the source column to the key (name) of the target column.
This is performed row by row -- although, it is not so slow as one would guess, because the actual insertion will be by several rows at the same time. So, there will be one dictionary per row, and, instead of inserting the data object (which is a list of tuples; one tuple corresponds to one row), you will be inserting a list of dictionaries.
Here is an example for one single row. The fetched data is a list with one tuple, and values is the built dictionary.
# data
[(1, 6, None, None, 204, 1, True, False, 204, 1.0, 1.0, 1.0, False, None]
# values
[{'DateDeleted': None, 'sentidocirculacao': False, 'TempoPercursoMed': 1.0,
'ExtensaoTroco': 204, 'OBJECTID': 229119, 'NumViasSentido': 1,
'Deleted': False, 'TempoPercursoMin': 1.0, 'IdCentroOp': 6,
'IDParagemInicio': None, 'IDParagemFim': None, 'TipoPavimento': True,
'TempoPercursoMax': 1.0, 'IDTroco': 1, 'CorredorBusext': 204}]
Note that Python dictionaries are not ordered, that is why the numbers in both lists are not in the same position. The geometric column was removed from this example for simplification.
4.2 Fixing geometries
Probably, the previous workaround would not be required if this issue had not occurred: sometimes geometries are stored/retrieved with the wrong type.
In MSSQL/ArcSDE, the geometry data type does not specify which type of geometry it is being stored (i.e., line, polygon, etc.). It only cares that it is a geometry. This information is stored in another (system) table, called SDE_geometry_columns (see in the bottom of that page). However, Postgres (PostGIS, actually) requires the geometry type when defining a geometric column.
This leads to spatial data being stored with the wrong geometry type. By wrong I mean that it is different than what it should be. For instance, looking at SDE_geometry_columns table (excerpt):
f_table_name geometry_type
TableName 9
geometry_type = 9
corresponds to ST_MULTILINESTRING
. However, there are rows in TableName table which are stored (or received) as ST_LINESTRING
. This mismatch raises an error in Postgres side.
As a workaround, you can edit the WKT while creating the aforementioned dictionaries. For example, 'LINESTRING (10 12, 20 22)'
is transformed to MULTILINESTRING ((10 12, 20 22))'
.
4.3 Missing SRID
Finally, if you are willing to keep the SRID's, you also need to define them when creating geometric columns.
If there is a SRID defined in the table model, it has to be satisfied when inserting data in Postgres. The problem is that when fetching geometry data as WKT with the .STAsText()
method, you lose the SRID information.
Luckily, PostGIS supports an Extended-WKT (E-WKT) format that includes the SRID.
The solution here is to include the SRID when fixing the geometries. With the same example, 'LINESTRING (10 12, 20 22)'
is transformed to 'SRID=102165;MULTILINESTRING ((10 12, 20 22))'
.
4.4 Fetch and insert
Once everything is fixed, you are ready to insert. As referred before, only now the data will be actually retrieved from the source. You can do this in chunks (a user defined amount) of data, for instance, 1000
rows at a time.
while True:
rows = data.fetchmany(1000)
if not rows:
break
values = [{key: (val if key.lower() != "shape" else fix(val, 102165))
for key, val in zip(keys, row)} for row in rows]
postgres_engine.execute(target_table.insert(), values)
Here fix()
is the function that will correct the geometries and prepend the given SRID to geometric columns (which are identified, in this example, by the column name of "shape") -- like described above --, and values is the aforementioned list of dictionaries.
Result
The result is a copy of the schema and data, existing on a MS SQL Server + ArcSDE database, into a PostgreSQL + PostGIS database.
Here are some stats, from my use case, for performance analysis. Both databases are in the same machine; the code was executed from a different machine, but in the same local network.
Tables | Geometry Column | Rows | Fixed Geometries | Insert Time
---------------------------------------------------------------------------------
Table 1 MULTILINESTRING 1114797 702 17min12s
Table 2 None 460874 --- 4min55s
Table 3 MULTILINESTRING 389485 389485 4min20s
Table 4 MULTIPOLYGON 4050 3993 34s
Total 3777964 871243 48min27s