2

I have 2 use cases:

  • Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster
  • Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable).

I am currently using:

  1. SQLAlchemy for extracts (works well generally).
  2. PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out).
  3. An internal tool for the scheduling component (which stores the transform in XML and then the loads from the XML and seems rather long and complicated).

I have been looking through this link but would welcome additional suggestions. Exporting to Spark or similar is also welcome if there is an "easier" process where I can just do everything through Python (I'm only using Redshift because it seems like the best option).

Dino Abraham
  • 151
  • 8

4 Answers4

1

You can try pyetl an etl framework write by python3

from pyetl import Task, DatabaseReader, DatabaseWriter
reader = DatabaseReader("sqlite:///db.sqlite3", table_name="source")
writer = DatabaseWriter("sqlite:///db.sqlite3", table_name="target")
columns = {"id": "uuid", "name": "full_name"}
functions={"id": str, "name": lambda x: x.strip()}
Task(reader, writer, columns=columns, functions=functions).start()
yatao
  • 11
  • 2
0

How about

  • Python
  • Pandas

This is what we use for our ETL processing.

Tim Seed
  • 5,119
  • 2
  • 30
  • 26
  • Yup I am using Python and want to continue. :) Interesting you mention Pandas. Do you have a link or tutorial on how Pandas can be used for ETL? I tend to use it for data manipulation only. Do you mean connect using SQLAlchemy (create an engine) and then convert the table to a Pandas dataframe, and export it into a Redshift table? I liked PETL because it allows you to simply extract / transform / load without needing to create tables but I'm happy to if this is easier. Looking forward to your thoughts. – Dino Abraham Sep 04 '17 at 16:03
  • Exactly as you mentioned. Pandas with SqlAlchemy. No need for temporary tables just manipulate the DataFrame so it is in the format suitable for loading into the table. If you want other products look at Pentaho or IBM's Datastage but these are Java front ends for ETL. – Tim Seed Sep 04 '17 at 16:12
  • Interesting. But how do you actually load the table using SQLAlchemy? How do you get up a daily job? Is it quick for "large" data as petl is pretty slow. – Dino Abraham Sep 04 '17 at 16:56
  • misunderstood your last comment above and now understand what you meant. I was more curious about the ETL piece within Python vs using 3rd party tools. Any ideas on this please? – Dino Abraham Sep 12 '17 at 18:06
  • Do you face any limitations using Pandas for 'big data'? I am actually using Spark and am finding it slow for joins of smaller tables (<1 GB), but better for larger tables (>1 GB). Thoughts? Curious also if you have any example Pandas/Python ETL job for me to look at to see how you do this. – Dino Abraham Sep 03 '19 at 17:55
0

I'm using Pandas to access my ETL files, try doing something like this:

  • Create a class with all your queries there.
  • Create another class that processes the actual Datawarehouse that includes Pandas and Matplotlib for the graph.
Achraf Almouloudi
  • 756
  • 10
  • 27
RAC
  • 115
  • 1
  • 1
  • 7
0

Consider having a look at convtools library, it provides lots of data processing primitives, is pure python and has zero dependencies.

Since it generates ad hoc python code under the hood, sometimes it outperforms pandas/polars, so it can some gaps in your workflows. Especially if those have dynamic nature.

westandskif
  • 972
  • 6
  • 9