3

I would like to (re)-start again with GemStone/S. I have done multiple ETL transformations for relational databases but I'm still fuzzy on how this is done at GemStone/S.

I would like to load data into GemStone from different sources. It could be files (csv, excel, xml, plain text, etc.) or other DBs like SQL Server, Postgres, Oracle, etc.

From what I saw at the pages there is GemConnect which connects to Oracle databases. How do you do it from other databases or files? Is there any option to connect via ODBC? Is there any data pump to do so or you "just" have to one yourself?

In the end I'm asking is how do you create a staging area, where you would clean-up, transform and then load the data into GemStone DB. Are there any examples or documentation how it is done?

Note: Only similar answer I have found is on SO - from Stephan Eggermont, but that was short and without any "real" information.

tukan
  • 17,050
  • 1
  • 20
  • 48

1 Answers1

3

Staging

I suspect that the reason that most environments have "ETL/staging" as a separate step is because the two endpoints are somewhat rigid and don't have a good programming language for data manipulation. That is, if you have TXT, CSV, XML, JSON, or SQL, and need it in another format/schema, then someone has to do the "transformation." But if you are working in GemStone, then you can do the transformation in Smalltalk--there is no need for a separate step.

Files

If you have files (TXT, CSV, XML, JSON, etc.), then use GsFile. In fact, if the other endpoint can deal with files, then just export from one source in an agreed format and then import in another (with GemStone doing the "heavy lifting" of transforming). Files are simpler, they avoid the communications layer, and they makes debugging trivial (if the source hasn't created the file, then it is the source's problem; if it is in the pending directory then haven't processed it yet (destination problem); if it is in the completed directory then the destination has processed it).

With this approach you start (one or more) background jobs in GemStone to watch a directory, open a file for read, process the file, and then move it to another directory. Other than basic string manipulation, you only need to work with GsFile. Then you create and update your objects in the database.

ODBC

While it would be possible to make FFI calls from GemStone to an ODBC library (or to a database's native library as is done with GemConnect), this would probably be unnecessarily complex. Instead, I'd create another layer using tools that have better interaction with the foreign system. This layer could write text files (as described above), or, with the proper interface, could communicate with GemStone directly. My inclination would be to use Dolphin to extract the data (good ODBC support), then communicate directly to GemStone from Dolphin. You could do something similar with other client Smalltalk dialects (Pharo, VA, or VW), or even from another language (I have a student working on a Python interface to GemStone).

O/R Mapping

Here again you are left with needing a way to take data in one format and translate it to another. These tend to be highly domain-specific and we find it easier to just write Smalltalk code. Alternatively, you could use something like GLORP in Pharo, VA, VW, etc.

Best Practices

I think you haven't found any "best practices" for ETL in GemStone because it isn't something we think of as an external process or separate step. There is just how to communicate with a file (GsFile), a socket (GsSocket), a library (CLibrary), or a client (GCI). From here we can look at internal processing issues like multiple producers and one consumer (RcQueue), or one producer and multiple consumers (locking).

So, it isn't that GemStone applications don't do ETL, they just do it internally and the situations are much more situation-specific.

James Foster
  • 2,070
  • 10
  • 15
  • Thank you for your ideas, it is highly appreciated! It should be both on-going and one-shot. The on-going is for is usually, but not always, some external system like SAP, CRM, DWH, etc. (in rare scenarios in can be also files, if the two system can't be interconnected (interfaced)). Till now this was done as *source system* => *ETL* => *destination DB*. The source is one of the named and ETL was usually a multi-step one from SAP BW which was pushing (via JDBC connector) the data to interface part of DB (usually SQL Server) which had a SSIS package(s) running on the SQL Server – tukan May 03 '19 at 08:14
  • The SSIS package was doing all the transformation needed and stores it into the DB. My question is about the SSIS/SQL Server part to be replaced. The one-shot is exactly oposite, it is usually files, the format is negotiated before the load, but different providers have different file types, thus my question about different format(s). It could also be the other way around (GemStone being as source). In the end it is about ORM where you have relational world on one side and object on the other. I would like to use Smalltalk/X as client (I love the mercurial vcs support there) – tukan May 03 '19 at 08:31
  • "staging area" in GemStone - I'm for it :). My question was how it was done in the past, are there any best practices how this was/is done? I virtually no examples/documentation on this topic. – tukan May 03 '19 at 08:35
  • I've updated the original answer based on your comments. – James Foster May 03 '19 at 14:11
  • *I suspect that the reason that most environments have "ETL/staging" as a separate step is because the two endpoints are somewhat rigid* That is very true - usually you have some graphical environment with some scripting posibilities (like SSIS where you can "see" the data flowing when the package is executed). Some have even made huge business from it like *Informatica* (which is also used, but only for external systems comunication). Thank you for the update! – tukan May 03 '19 at 14:21