I have 2 use cases:
- Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster
- Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable).
I am currently using:
- SQLAlchemy for extracts (works well generally).
- PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out).
- An internal tool for the scheduling component (which stores the transform in XML and then the loads from the XML and seems rather long and complicated).
I have been looking through this link but would welcome additional suggestions. Exporting to Spark or similar is also welcome if there is an "easier" process where I can just do everything through Python (I'm only using Redshift because it seems like the best option).