Create New Rows from Oracle CLOB and Write to HDFS

Question

In an Oracle database, I can read this table containing a CLOB type (note the newlines):

ID         MY_CLOB
001        500,aaa,bbb
           500,ccc,ddd
           480,1,2,bad
           500,eee,fff
002        777,0,0,bad
003        500,yyy,zzz

I need to process this, and import into an HDFS table with new rows for each MY_CLOB line starting with "500,". In this case, the hive table should look like:

ID     C_1    C_2    C_3
001    500    aaa    bbb
001    500    ccc    ddd
001    500    eee    fff
003    500    yyy    zzz

This solution to my previous question succeeds in producing this on Oracle. But writing the result to HDFS with a Python driver is very slow, or never succeeds.

Following this solution, I've tested a similar regex + pyspark solution that might work for my purposes:

<!-- begin snippet: js hide: true -->
import cx_Oracle
#... query = """SELECT ID, MY_CLOB FROM oracle_table"""
#... cx_oracle_results <--- fetchmany results (batches) from query

import re
from pyspark.sql import Row
from pyspark.sql.functions import col
def clob_to_table(clob_lines):
    m = re.findall(r"^(500),(.*),(.*)", 
                   clob_lines, re.MULTILINE)
    return Row(C_1 = m.group(1), C_2 = m.group(2), C_3 = m.group(3))

# Process each batch of results and write to hive as parquet
for batch in cx_oracle_results():
    # batch is like [(1,<cx_oracle object>), (2,<cx_oracle object>), (3,<cx_oracle object>)]
    # When `.read()` looks like [(1,"500,a,b\n500c,d"), (2,"500,e,e"), (3,"500,z,y\n480,-1,-1")]
    df = sc.parallelize(batch).toDF(["ID", "MY_CLOB"])\
           .withColumn("clob_as_text", col("MY_CLOB")\
           .read()\  # Converts cx_oracle CLOB object to text.
           .map(clob_to_table)
    df.write.mode("append").parquet("myschema.pfile")

But reading oracle cursor results and feeding them into pyspark this way doesn't work well.

I'm trying to to run a sqoop job generated by another tool, importing the CLOB as text, and hoping I can process the sqooped table into a new hive table like the above in reasonable time. Perhaps with pyspark with a solution similar to above.

Unfortunately, this sqoop job doesn't work.

sqoop import -Doraoop.timestamp.string=false -Doracle.sessionTimeZone=America/Chicago 
-Doraoop.import.hint=" " -Doraoop.oracle.session.initialization.statements="alter session disable parallel query;" 
-Dkite.hive.tmp.root=/user/hive/kite_tmp/wassadamo --verbose 
--connect jdbc:oracle:thin:@ldap://connection/string/to/oracle 
--num-mappers 8 --split-by date_column 
--query "SELECT * FROM (
    SELECT ID, MY_CLOB
    FROM oracle_table
    WHERE ROWNUM <= 1000
    ) WHERE \$CONDITIONS" 
--create-hive-table --hive-import --hive-overwrite --hive-database my_db 
--hive-table output_table --as-parquetfile --fields-terminated-by \| 
--delete-target-dir --target-dir $HIVE_WAREHOUSE --map-column-java=MY_CLOB=String 
--username wassadamo --password-file /user/wassadamo/.oracle_password

But I get an error (snippet below):

20/07/13 17:04:08 INFO mapreduce.Job:  map 0% reduce 0%
20/07/13 17:05:08 INFO mapreduce.Job: Task Id : attempt_1594629724936_3157_m_000001_0, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
...
Caused by: java.sql.SQLDataException: ORA-01861: literal does not match format string

This seems to have been caused by mapping the CLOB column to string. I did this based on this answer.

How can I fix this? I'm open to a different pyspark solution as well

Did you think of splitting the clob in the database instead of doing it on pyspark ? If you do the hard work in the database, it will go much faster. — Roberto Hernandez, Jul 21 '20 at 08:12
I have no UPDATE/CREATE permissions, only SELECT @RobertoHernandez. I've tried running that first [solution SQL](https://stackoverflow.com/questions/62791453/split-multiline-clob-column-oracle-pl-sql/62791851#62791851) through the Python driver, then writing to local csv, but as I said, it's very slow, or never terminates. — Wassadamo, Jul 21 '20 at 17:47
And the query seems too complex for sqoop. But if so, when and how should I parse the clobs? — Wassadamo, Jul 21 '20 at 17:53
I agree that the query is quite complex for sqoop, but if you don't have an option to create a view is your only chance. I don't think anyway that the ORA-01861 is due to map-column-java=clob:string . That is actually the only way to import clob in Hive. — Roberto Hernandez, Jul 21 '20 at 18:19
If the query is too complex for sqoop, perhaps I can create the view in Hive instead? — Wassadamo, Jul 21 '20 at 19:47
If someone could fix my python snippet, allowing it to be hidden, I'd appreciate that. Can't get the js prefix to work. — Wassadamo, Jul 21 '20 at 20:02

score 0 · Answer 1 · answered Jul 21 '20 at 22:48

Partial answer: the oracle error seems to have been due to

--split-by date_column

This date_column is an Oracle Date type. Turns out it doesn't work when sqooping from Oracle. It would be nice to be able to split on this. But splitting on ID (varchar2) seems to be working.

The issue of performantly parsing the text MY_CLOB field and creating new rows for each line remains.

Create New Rows from Oracle CLOB and Write to HDFS

1 Answers1