QueryDatabaseTable Nifi Processor fetches duplicate rows from mysql database

Question

QueryDatabaseTable is fetching rows from Mysql table twice on a 2 node cluster. When Execution setting is configured to all nodes, the fetching process itself is not distributed. Each node fetch similar data which isn't the ideal output I need. However changing it to primary node, it works fine, but a single node is burdened the whole process of fetching data which defeats the point of distributed computing. Is there a workaround to this ?

score 1 · Accepted Answer · answered Jan 17 '18 at 13:32

QueryDatabaseTable is designed to run on the Primary Node only with one task only, it does a single fetch and is not a distributed solution. For parallel/distributed fetching in a cluster, you will want the following:

GenerateTableFetch -> RemoteProcessGroup -> Input Port -> ExecuteSQL

GenerateTableFetch should be set to execute on the Primary Node only with one task. It does the "first half" of what QueryDatabaseTable does, by generating SQL statements to grab batches of rows of a specified size. But it does not do the fetch.

Instead you send the SQL statements to a RemoteProcessGroup (RPG) which points to an Input Port on the same cluster. This will distribute the SQL statements among the nodes in the NiFi cluster.

The ExecuteSQL processor on each node will get a subset of the SQL statements to execute, thereby performing the fetch in parallel across the cluster. Note that the remainder of this flow will execute in parallel, you won't be able to join up the results later, but it sounds like that's not what you want anyway.

score 0 · Answer 2 · answered Jan 17 '18 at 07:20

0

rdbms (mysql) does not belong to distributed computing.

so, better to ingest data from such sources on primary node only.

however after ingestion you can split the data and distribute it across the whole nifi cluster.

see this article https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

answered Jan 17 '18 at 07:20

daggett

26,404
3
40
56

When I have huge amount of data in RDBMS say in TB's, using a single node won't help. Any way to get such huge data to NIFI in parallel? – Vishal Kulkarni Jan 17 '18 at 09:56
you can try to split data logically. for example if you have `date` field - you can select odd days on one node, others on second node. but i don't think it will be fast enough. as a variant: you can [export data from mysql to cvs-like file with native mysql tools](https://stackoverflow.com/questions/12040816/mysqldump-in-csv-format), and then work in nifi with file. this should be faster then receiving data through sql query. – daggett Jan 17 '18 at 12:19

QueryDatabaseTable Nifi Processor fetches duplicate rows from mysql database

2 Answers2