1

I sqooped data from Oracle and the table had a column with CLOB DataType, I made it String to get the data in HDFS. Now I have to dismantle the CLOB Data and create a separate table for that in Hive.

I have the HDFS file in txt format. I can segregate the CLOB data and be hoping to make DataFrame for CLOB

The CLOB is in the following format :

[name] Bob [Age] 21 [City] London [work] No,
[name] Steve [Age] 51 [City] London [work] Yes,
.....
around a million rows like this
sc.setLogLevel("WARN")
log_txt=sc.textFile("/path/to/data/sample_data.txt")
header = log_txt.first()

log_txt = log_txt.filter(lambda line: line != header)
log_txt.take(10)
  [u'0\\tdog\\t20160906182001\\tgoogle.com', u'1\\tcat\\t20151231120504\\tgoogle.com']

temp_var = log_txt.map(lambda k: k.split("\\t"))

log_df=temp_var.toDF(header.split("\\t"))

log_df = log_df.withColumn("field1Int", log_df["field1"].cast(IntegerType()))
log_df = log_df.withColumn("field3TimeStamp", log_df["field1"].cast(TimestampType()))

log_df.schema
StructType(List(StructField(field1,StringType,true),StructField(field2,StringType,true),StructField(field3,StringType,true),StructField(field4,StringType,true),StructField(field1Int,IntegerType,true),StructField(field3TimeStamp,TimestampType,true)))

This is how I have created DataFrame.

I need your help to figure out how to dismantle the CLOB, Which is in the form of String Data Type. and Create a table on top of that.

After dismantling, I expect the Table to have following Columns like:

+---------+---------------+----------+-----+
|Name     |Age            | City     | Work|
+---------+---------------+----------+-----+
|      Bob|           21  |London    | No  |
|    Steve|           51  |London    |Yes  |
+---------+---------------+----------+-----+

Any help would be appreciated.

sareen
  • 19
  • 4

1 Answers1

0

here it is:

import re
from pyspark.sql import Row

rdd = sc.parallelize(["[name] Bob [Age] 21 [City] London [work] No",
                      "[name] Steve [Age] 51 [City] London [work] Yes",
                      "[name] Steve [Age] [City] London [work] Yes"])

def clob_to_table(line):
    m = re.search(r"\[name\](.*)?\[Age\](.*)?\[City\](.*)?\[work\](.*)?", line)
    return Row(name=m.group(1), age=m.group(2), city=m.group(3), work=m.group(4))

rdd = rdd.map(clob_to_table)

df = spark.createDataFrame(rdd)
df.show()

+----+--------+-------+----+
| age|    city|   name|work|
+----+--------+-------+----+
| 21 | London |   Bob |  No|
| 51 | London | Steve | Yes|
|    | London | Steve | Yes
+----+--------+-------+----+
  • Thank you very much! I needed this only. But there are couple of doubts, First, I see you have used Quotes in the starting and end of data, but I have a file of millions of records, so How would you recommend doing that? Second, What if in some instance [City] is missing, I tried but I got the error, I am thinking to mark it as NULL whenever they are not having values. Any suggestions for that too? Thanks again! – sareen Nov 14 '19 at 04:46
  • 1 - This syntax is for me to load your example without the need of creating a file if I had a file the file could be loaded by doing this: text_file = sc.textFile("hdfs://...") This assumes your file save each row in one line if that is not the format you should probably do a preprocessing that splits the file one in Row per line 2 - This depending the way missed instances are included you may have to change the regex for instance if a city is missing but the label city is included it is ok include NULL but if the label 'city' is missing change the regex – aleaugustoplus Nov 14 '19 at 17:55