23

I'm looking for an embeddable Java ETL, i.e., an Extract Transform Load engine that can be called from Java code.

I'm finding it surprisingly hard to find a suitable one.

I'm mainly looking at loading delimited text files into database tables, with some minor transforms along the way.

I'd like the following features:

  • the ability to specify the simple mappings externally, e.g, text column 5 to database column foo, specified some xml mapping file
  • the ability to give the the database node a javax.sql.Datasource

CloverETL allows mapping to be specified in XML, but database connections must be either JNDI names or a properties file specifying driverClass, url, dbusername, password, etc. Since I already have javax.sql.Datasources set up by my dependency injection framework, properties files seem painful and non-robust, especially if I want this to work in several environments (dev, test, prod).

KETL tells me that "We are currently in the process of completely overhauling our documentation for KETL™. Because of this, only the installation guide has been updated." Honest, but not helpful.

Octopus is now "http://www.together.at/prod/database/tdt", which is "under construction".

Pentaho seems to use the same "specify driverClass" style that CloverETL does, rather that using a datasource, but Pentaho's documentation for calling their engine from java code is just difficult to find.

Basically I'd really like to be able to do this pseudo-code:

extractTransformLoad(         
        getInputFile( "input.csv" ) , 
        getXMLMapping( "myMappingFile.xml") ,
        new DatabaseWriter( getDatasource() );

Any suggestions?

tpdi
  • 34,554
  • 11
  • 80
  • 120
  • I'd think that there would be many other requirements of far higher priority than the ease of rigging it up - at least, given the (IMHO) minor issues you've raised with CloverETL. As far as doc problems getting started... it's often a problem for this sort of middleware, especially when there's a commercial version. Go to whatever forums are available, search, and post a query (there or here) if you can't find anything. Also, be sure to check out examples and sometimes even the project test code. – Ed Staub Nov 07 '12 at 15:12

4 Answers4

15

Disclosure: I'm the author of Scriptella ETL, but I believe this tool might be useful for your case.

It's a lightweight open source ETL with a one-liner integration with Java. It also supports Spring Framework and comes with built-in drivers for CSV, text, XML, Excel and other data-sources.

Example of importing a CSV file into a table:

<!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
<etl>
  <connection id="in" driver="csv" url="data.csv" />
  <connection id="out" driver="oracle" url="jdbc:oracle:thin:@localhost:1521:ORCL" 
      classpath="ojdbc14.jar" user="scott" password="tiger" />
  <!-- Copy all CSV rows to a database table -->
  <query connection-id="in">
      <!-- Empty query means select all columns -->
      <script connection-id="out">
          INSERT INTO Table_Name VALUES (?id,?priority, ?summary, ?status)
      </script>
  </query>
</etl>

Running from Java:

// Execute etl.xml file
EtlExecutor.newExecutor(new File("etl.xml")).execute();

Running from command-line:

scriptella [file_name]

Integration with Spring:

  1. Use "spring" driver and the name of the bean to references data-sources. Example:

    <connection id="spring" driver="spring" url="datasourceBeanName" />
    
  2. Add EtlExecutorBean to the application context in order to execute the job:

    <bean id="createDb" class="scriptella.driver.spring.EtlExecutorBean">
        <property name="configLocation" value="create-db.etl.xml" />
        <property name="progressIndicator"><ref local="progress" /></property>
        <property name="autostart" value="true" /> <!-- Etl will be run during app context initialization -->
    </bean>
    

For additional details see the Spring example.

ejboy
  • 4,081
  • 5
  • 30
  • 32
  • 2
    Dare to explain the downvote? Do you have any other examples with such a simple integration with Java code? – ejboy Nov 07 '12 at 14:47
  • It seems you're marketing your own software, I'd *guess* that's the reason for the downvote. – hd1 Feb 12 '13 at 16:49
  • Could you please add an example howto use propertie files to handle several environments (aka. dev.properties, test.properties, prod.properties), especially to branch on different connection strings, users and password. Thanks! – user2161065 Sep 24 '13 at 08:07
4

Do you know Talend?

It's a tool based on Eclipse (Talend Open Studio), but you can use it directly in Java by writing your own code or by exporting jobs to Java classes.

Loïc Guillois
  • 373
  • 1
  • 2
  • 10
1

Here is a list of all the java based open source ETL libraries. I see you have evaluated few of them already but there are more. Also this seems to be a duplicate of https://stackoverflow.com/questions/272517/please-recommend-a-powerful-java-based-etl-framework

Community
  • 1
  • 1
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
  • Yes, I've gone down the that list, and looked at the earlier question (from two years ago). I still haven't found my answer, so either I missed something (please point it out) or the answer's not there. In fact, I did all that two weeks ago, before asking this question. – tpdi Nov 23 '10 at 14:31
0

CloverETL Engine is easily embeddable as well as extendible, so you can write your own connection and plug it in to CLoverETL. The DBConnection object will be slightly changed in CloverETL 3.1, to be more extendible and the implementation of its descendant, that uses DataSource for connection to database will be as a child's play.

Agad
  • 26
  • 1