sqoop split by option is giving error while using a derived column in the split by option

Question

I have an Oracle query which is fetching 25 million records, there is no pk or no columns which is distributed properly to make as a split by column. So I have thought of making a sequence number using ROW_number() over () as RANGEGROUP. But when I use this pseudo column its giving me an error saying

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: "P"."RANGEGROUP": invalid identifier at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:91).

I am properly giving the alias, even I tried with out alias to the pseudo column, its still giving the same error. Can we use derived columns in Sqoop split by, or the column should be physically present in table?

Read this related answer:https://stackoverflow.com/a/37389134/2700344 — leftjoin, Jul 21 '20 at 19:20
You should use deterministic non-aggregation, not analytic function. ROW_number() over () is analytic and non-deterministic (because there is no order by clause). — leftjoin, Jul 21 '20 at 19:22
(row_number() OVER (order by p.item_id ) AS RANGEGROUP ,I am using this with order by clause . My issue is this query is runing file in ORACLE.where as this derived column using in Split-by of sqoop is failing .Can we use derived columns in split by — anu john, Jul 22 '20 at 04:05
Sure you can use derived columns. Theproblem seems not with sqoop here. It looks like JDBC driver exception. Maybe the feature is not supported by driver. Try also to wrap in the subquery `select col1, ... colN, RANGEGROUP from (select t.*, row_number ... from table t ) s ` and use derived column in the sqoop. Execute sqoop in verbose mode and you will see queries it executes. You will understand better what the problem is. — leftjoin, Jul 22 '20 at 07:00

leftjoin · Answer 1 · 2021-06-11T09:14:03.907

Use subquery to wrap row_number calculation, then use derived column in the split-by.

   --query "select col1, ... colN, RANGEGROUP 
               from (select t.*, row_number() OVER (order by t.item_id ) AS RANGEGROUP
                      from table t ) s 
              where 1=1 and \$CONDITIONS"

row_number should be deterministic, it means when executed multiple times, it should assign exactly the same number to all rows. What can happen if ORDER BY in the OVER contains not unique column or combination: row_number can return different numbers for the same rows. And if you are using it in the split-by, you will get duplication because the same row can be in split range 1, say 1-100, in mapper2 sqoop will execute same query with filter for range 2, say (101-200) the same row can appear also in that range. Sqoop runs the same query in different containers(mappers) with different condition to get split ranges in parallel.

If Id is int (and much better if it is evenly distributed), use that ID. Why you may need row_number is when it is STRING column. read this: https://stackoverflow.com/a/37389134/2700344, split-column is not necessarily a PK

sqoop split by option is giving error while using a derived column in the split by option

1 Answers1

Linked