3

We are having use case for retail industry data. We are into making of EDW.

We are currently doing reporting from HAWQ.But We wanted to shift our MPP database from Hawq into Greenplum. Basically,We would like to make changes into current data pipeline.

Our confusion points about gpdb :

  • HOW gpdb layer going to affect our existing data pipeline. Here data pipeline is external system --> talend -->hadoop-hawq-->tableau. We want to transform our data pipeline as external system --> talend -->hadoop-hawq-->greenplum -->tableau.
  • How Greenplum is physically or logically going to help in SQL transformation and reporting.

  • Which file format should i opt for storing the files in GPDB while
    HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.

  • How is data file processed from GPDB . so, that it also bring faster reporting and predictive analysis.

  • Is there any way to push data from HAWQ into Greenplum? We are
    looking for guidance how to take shift our reporting use case from
    HAWQ INTO Greenplum.

Any help on it would be much appreciated?

NEO
  • 389
  • 8
  • 31

1 Answers1

3

This query is sort of like asking, "when should I use a wrench?" The answer is also going to be subjective as Greenplum can be used for many different things. But, I will do my best to give my opinion because you asked.

HOW gpdb layer going to affect our existing data pipeline. Here data pipeline is external system --> talend -->hadoop-hawq-->tableau. We want to transform our data pipeline as external system --> talend -->hadoop-hawq-->greenplum -->tableau.

There are lots of ways to do the data pipeline your goal of loading data into Hadoop first and then load it to Greenplum is very common and works well. You can use External Tables in Greenplum to read data in parallel, directly from HDFS. So the data movement from the Hadoop cluster to Greenplum can be achieved with a simple INSERT statement.

INSERT INTO greenplum_customer SELECT * FROM hdfs_customer_file;

How Greenplum is physically or logically going to help in SQL transformation and reporting.

Isolation for one. With a separate cluster for Greenplum, you can provide analytics to your customers without impacting the performance of your Hadoop activity and vice-versa. This isolation also can provide an additional security layer.

Which file format should i opt for storing the files in GPDB while HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.

With your data pipeline as you suggested, I would make the data format decision in Greenplum based on performance. So large tables, partition the tables and make it column oriented with quicklz compression. For smaller tables, just make it append optimized. And for tables that have lots of updates or deletes, keep it the default heap.

How is data file processed from GPDB . so, that it also bring faster reporting and predictive analysis.

Greenplum is an MPP database. The storage is "shared nothing" meaning that each node has unique data that no other node has (excluding mirroring for high-availability). A segment's data will always be on the local disk.

In HAWQ, because it uses HDFS, the data for the segment doesn't have to be local. Day 1, when you wrote the data to HDFS, it was local but after failed nodes, expansion, etc, HAWQ may have to fetch the data from other nodes. This makes Greenplum's performance a bit more predictable than HAWQ because of how Hadoop works.

Is there any way to push data from HAWQ into Greenplum? We are looking for guidance how to take shift our reporting use case from HAWQ INTO Greenplum.

Push, no but pull, Yes. As I mentioned above, you can create an External Table in Greenplum to SELECT data from HDFS. You can also create Writable External Tables in Greenplum to push data to HDFS.

Jon Roberts
  • 2,068
  • 1
  • 9
  • 11
  • Thanks Jon for information.It helps me alot to understand. SO, We have decided for upgrade of gpdb . – NEO May 21 '16 at 05:15
  • upgrade from 4.3.4.1 to 4.3.8.X . should we need to go for it. or anything 4.3.X is ok . we need you suggestion on it. – NEO May 21 '16 at 05:19
  • 4.3.5 is a significant upgrade and you will need to also upgrade extensions like gptext if you also installed that. Be sure to backup the database first with gpcrondump before upgrading. But upgrading to the latest version will bring you new features and more stability so I always recommend being on the latest version. – Jon Roberts May 22 '16 at 15:16
  • I am upgrading greenplum from 4.3.4.1 TO 4.3.8.2. GETTING error at gpseginstall utility . here is Upgrade error -http://paste.ofcode.org/dZu2SMvKh94GA2ApQqb65t – NEO May 22 '16 at 15:40
  • Upgradation process completed we are heading towards 4.3.8.2. – NEO May 23 '16 at 12:28
  • while Instalaltion of GPCC . at last prompt we stucked. – NEO May 24 '16 at 11:12
  • Hi Jon, we get some Greenplum questions on dba.se (like [this one](http://dba.stackexchange.com/q/85884/1396)), and I'm thinking it might be nice to have the database as an option on [dbfiddle](http://dbfiddle.uk/?rdbms=postgres_9.6) (which I [created and look after](https://dba.meta.stackexchange.com/questions/2686/a-new-fiddle-for-dba-se)). Pop into [the Heap](http://chat.stackexchange.com/rooms/179/the-heap--consultancy-) sometime if you'd like to chat about the possibility. –  Mar 22 '17 at 16:34