0

Greenplum Database version: PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1)

SQL Server Database version: Microsoft SQL Server 2008 R2 (SP1)

Our current approach:

1) Export each table to a flat file from SQL Server

2) Load the data into Greenplum with pgAdmin III using PSQL Console's psql.exe utility


Benifits...

  • Speed: OK, but is there anything faster? We load millions of rows of data in minutes

  • Automation: OK, we call this utility from an SSIS package using a Shell script in VB


Pitfalls...

  • Reliability: ETL is dependent on the file server to hold the flat files

  • Security: Lots of potentially sensitive data on the file server

  • Error handling: It's a problem. psql.exe never raises an error that we can catch even if it does error out and loads no data or a partial file


What else we have tried...

.Net Providers\Odbc Data Provider: We have configured a System DSN using DataDirect 6.0 Greenplum Wire Protocol. Good performance for a DELETE. Dog awful slow for an INSERT.

For reference, this is the aforementioned VB script in SSIS...

Public Sub Main()

    Dim v_shell
    Dim v_psql As String


    v_psql = "C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql"

    v_shell = Shell(v_psql, AppWinStyle.NormalFocus, True)

End Sub


This is the contents of the "load_MyTable.sql" file...

\copy MyTable from '\\MyFileLocation\SSIS_load\txt_files\MyTable.txt' with delimiter as ';' csv header quote as '"'
Jon Jaussi
  • 1,298
  • 3
  • 18
  • 36
  • [Greenplum](http://www.greenplum.com/products) what? To reproduce your problem, what product will I install? – billinkc Apr 17 '13 at 02:48
  • Thanks for your comment. PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1). I added database version info to the top of the question. – Jon Jaussi Apr 17 '13 at 14:07
  • Just a comment.... you can call this command line directly in a .CMD (or.BAT) file: `C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql`. If you want to do anything sophisticated like replace/ dynamic parameters then sure put it in VBScript but its not necessary. You can even put this in your SSIS package or as a job step to automate the whole thing – Nick.Mc Jan 29 '15 at 11:51
  • ... and there's something about speeding up ODBC inserts at the bottom of this: http://stackoverflow.com/questions/12206600/how-to-speed-up-insertion-performance-in-postgresql. I know these are just links but I'm not posing an answer here – Nick.Mc Jan 29 '15 at 12:03

1 Answers1

1

If you're getting your data load done in minutes, then the current method is probably good enough. However, if you find yourself having to load larger volumes of data (terabyte scale for instance), the usual preferred method for bulk-loading into Greenplum is via gpfdist and corresponding EXTERNAL TABLE definitions. gpload is a decent wrapper that provides abstraction over much of this process and is driven by YAML control files. The general idea is that gpfdist instance(s) are spun up at the location(s) where your data is staged, preferrably as CSV text files, and then the EXTERNAL TABLE definition within Greenplum is made aware of the URIs for the gpfdist instances. From the admin guide, a sample definition of such an external table could look like this:

CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in,
name=20, address=30, age=4,
preserve_blanks='on',null='NULL');

The above example expects to read text files whose fields from left to right are a 20-character (at most) string, a 30-character string, and an integer. To actually load this data into a staging table inside GP:

CREATE TABLE staging_table AS SELECT * FROM students;

For large volumes of data, this should be the most efficient method since all segment hosts are engaged in the parallel load. Do keep in mind that the simplistic approach above will probably result in a randomly distributed table, which may not be desirable. You'd have to customize your table definitions to specify a distribution key.

rchang
  • 5,150
  • 1
  • 15
  • 25