0

I have a file that has a mix of comma delimited lines and pipe delimited lines I need to import into Databricks.

Is it possible to indicate the use of two or more different separators when creating a sql table in Databricks/Spark?

I see lots of posts for multiple character separators, but nothing on different separators.

etc.

I'm currently using something like this.

create table myschema.mytable (
  foo string,
  bar string
)
using csv
options (
  header = "true",
  delimiter = ","
);
John
  • 3,458
  • 4
  • 33
  • 54
  • What is `using cvs`? I think you have two options 1) Filter all lines with each delimiter into their own dataframe, then join them. 2) Use regex to split into rows and create a uniform dataframe – OneCricketeer Sep 28 '20 at 23:02
  • Sample data. Expected results. Please. – wBob Sep 29 '20 at 00:41
  • should read csv, not cvs (corrected in the original post) – John Sep 29 '20 at 01:42
  • I can think of several ways to do it myself (most of them likely to work 99.999% of the time) but I'm hoping this is something built into Databricks/Spark and I don't have to do something custom myself. In other words, I'd like to pass the file I'm provided to my Databricks solution rather than pre-processing it myself. – John Sep 29 '20 at 01:45

1 Answers1

1

One methood you could try is to create spark dataframe first and then make a table out of it. Giving example for a hypothetical case below using pyspark where delimiters were | and -

BEWARE: we are using split and it means that it will split everything, e.g. 2000-12-31 is a value yest it will be split. Therefor we should be very sure that no such case would ever occur in data. As general advice, one should never accept these types of files as there are accidents waiting to happen.

How sample data looks: in this case we have 2 files in our directory with | and - occurring randomly as delimiters

Sample data pic

# Create RDD. Basically read as simple text file. 
# sc is spark context
rddRead = sc.textFile("/mnt/adls/RI_Validation/ReadMulktipleDelimerFile/Sample1/") 
rddRead.collect() # For debugging

How initial rdd looks

import re # Import for usual python regex 

# Create another rdd using simple string opertaions. This will be similar to list of lists.
# Give regex expression to split your string based on anticipated delimiters (this could be dangerous 
# if those delimiter occur as part of value. e.g.: 2021-12-31 is a single value in reality. 
# But this a price we have to pay for not having good data). 
# For each iteration, k represents 1 element which would eventually become 1 row (e.g. A|33-Mech)

rddSplit = rddRead.map(lambda k: re.split("[|-]+", k)) # Anticipated delimiters are | OR - in this case.
rddSplit.collect() # For debugging

Split on delimiters

# This block is applicable only if you have headers
lsHeader = rddSplit.first()  # Get First element from rdd as header.
print(lsHeader) # For debugging
print()
# Remove rows representing header. (Note: Have assumed name of all columns in 
# all files are same. If not, then will have to filter by manually specifying 
#all of them which would be a nightmare from pov of good code as well as maintenance)
rddData = rddSplit.filter(lambda x: x != lsHeader) 
rddData.collect() # For debugging

Separate Header and Data

# Convert rdd to spark dataframe
# Utilise the header we got in earlier step. Else can give our own headers.
dfSpark = rddData.toDF(lsHeader)
dfSpark.display() # For debugging

Final Spark df

Nirupam Nishant
  • 209
  • 3
  • 10