In an Oracle database, I can read this table containing a CLOB type (note the newlines):
ID MY_CLOB
001 500,aaa,bbb
500,ccc,ddd
480,1,2,bad
500,eee,fff
002 777,0,0,bad
003 500,yyy,zzz
I need to process this, and import into an HDFS table with new rows for each MY_CLOB line starting with "500,". In this case, the hive table should look like:
ID C_1 C_2 C_3
001 500 aaa bbb
001 500 ccc ddd
001 500 eee fff
003 500 yyy zzz
This solution to my previous question succeeds in producing this on Oracle. But writing the result to HDFS with a Python driver is very slow, or never succeeds.
Following this solution, I've tested a similar regex + pyspark solution that might work for my purposes:
<!-- begin snippet: js hide: true -->
import cx_Oracle
#... query = """SELECT ID, MY_CLOB FROM oracle_table"""
#... cx_oracle_results <--- fetchmany results (batches) from query
import re
from pyspark.sql import Row
from pyspark.sql.functions import col
def clob_to_table(clob_lines):
m = re.findall(r"^(500),(.*),(.*)",
clob_lines, re.MULTILINE)
return Row(C_1 = m.group(1), C_2 = m.group(2), C_3 = m.group(3))
# Process each batch of results and write to hive as parquet
for batch in cx_oracle_results():
# batch is like [(1,<cx_oracle object>), (2,<cx_oracle object>), (3,<cx_oracle object>)]
# When `.read()` looks like [(1,"500,a,b\n500c,d"), (2,"500,e,e"), (3,"500,z,y\n480,-1,-1")]
df = sc.parallelize(batch).toDF(["ID", "MY_CLOB"])\
.withColumn("clob_as_text", col("MY_CLOB")\
.read()\ # Converts cx_oracle CLOB object to text.
.map(clob_to_table)
df.write.mode("append").parquet("myschema.pfile")
But reading oracle cursor results and feeding them into pyspark this way doesn't work well.
I'm trying to to run a sqoop job generated by another tool, importing the CLOB as text, and hoping I can process the sqooped table into a new hive table like the above in reasonable time. Perhaps with pyspark with a solution similar to above.
Unfortunately, this sqoop job doesn't work.
sqoop import -Doraoop.timestamp.string=false -Doracle.sessionTimeZone=America/Chicago
-Doraoop.import.hint=" " -Doraoop.oracle.session.initialization.statements="alter session disable parallel query;"
-Dkite.hive.tmp.root=/user/hive/kite_tmp/wassadamo --verbose
--connect jdbc:oracle:thin:@ldap://connection/string/to/oracle
--num-mappers 8 --split-by date_column
--query "SELECT * FROM (
SELECT ID, MY_CLOB
FROM oracle_table
WHERE ROWNUM <= 1000
) WHERE \$CONDITIONS"
--create-hive-table --hive-import --hive-overwrite --hive-database my_db
--hive-table output_table --as-parquetfile --fields-terminated-by \|
--delete-target-dir --target-dir $HIVE_WAREHOUSE --map-column-java=MY_CLOB=String
--username wassadamo --password-file /user/wassadamo/.oracle_password
But I get an error (snippet below):
20/07/13 17:04:08 INFO mapreduce.Job: map 0% reduce 0%
20/07/13 17:05:08 INFO mapreduce.Job: Task Id : attempt_1594629724936_3157_m_000001_0, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
...
Caused by: java.sql.SQLDataException: ORA-01861: literal does not match format string
This seems to have been caused by mapping the CLOB column to string. I did this based on this answer.
How can I fix this? I'm open to a different pyspark solution as well