Different delimiters on different lines in the same file for Databricks Spark

Question

I have a file that has a mix of comma delimited lines and pipe delimited lines I need to import into Databricks.

Is it possible to indicate the use of two or more different separators when creating a sql table in Databricks/Spark?

I see lots of posts for multiple character separators, but nothing on different separators.

etc.

I'm currently using something like this.

create table myschema.mytable (
  foo string,
  bar string
)
using csv
options (
  header = "true",
  delimiter = ","
);

What is `using cvs`? I think you have two options 1) Filter all lines with each delimiter into their own dataframe, then join them. 2) Use regex to split into rows and create a uniform dataframe — OneCricketeer, Sep 28 '20 at 23:02
I can think of several ways to do it myself (most of them likely to work 99.999% of the time) but I'm hoping this is something built into Databricks/Spark and I don't have to do something custom myself. In other words, I'd like to pass the file I'm provided to my Databricks solution rather than pre-processing it myself. — John, Sep 29 '20 at 01:45

Nirupam Nishant · Answer 1 · 2021-12-09T07:51:20.350

One methood you could try is to create spark dataframe first and then make a table out of it. Giving example for a hypothetical case below using pyspark where delimiters were | and -

BEWARE: we are using split and it means that it will split everything, e.g. 2000-12-31 is a value yest it will be split. Therefor we should be very sure that no such case would ever occur in data. As general advice, one should never accept these types of files as there are accidents waiting to happen.

How sample data looks: in this case we have 2 files in our directory with | and - occurring randomly as delimiters

# Create RDD. Basically read as simple text file. 
# sc is spark context
rddRead = sc.textFile("/mnt/adls/RI_Validation/ReadMulktipleDelimerFile/Sample1/") 
rddRead.collect() # For debugging

import re # Import for usual python regex 

# Create another rdd using simple string opertaions. This will be similar to list of lists.
# Give regex expression to split your string based on anticipated delimiters (this could be dangerous 
# if those delimiter occur as part of value. e.g.: 2021-12-31 is a single value in reality. 
# But this a price we have to pay for not having good data). 
# For each iteration, k represents 1 element which would eventually become 1 row (e.g. A|33-Mech)

rddSplit = rddRead.map(lambda k: re.split("[|-]+", k)) # Anticipated delimiters are | OR - in this case.
rddSplit.collect() # For debugging

# This block is applicable only if you have headers
lsHeader = rddSplit.first()  # Get First element from rdd as header.
print(lsHeader) # For debugging
print()
# Remove rows representing header. (Note: Have assumed name of all columns in 
# all files are same. If not, then will have to filter by manually specifying 
#all of them which would be a nightmare from pov of good code as well as maintenance)
rddData = rddSplit.filter(lambda x: x != lsHeader) 
rddData.collect() # For debugging

# Convert rdd to spark dataframe
# Utilise the header we got in earlier step. Else can give our own headers.
dfSpark = rddData.toDF(lsHeader)
dfSpark.display() # For debugging

Updated regex patter to include + in case the delimiters are repeated consecutively — Nirupam Nishant, Dec 09 '21 at 05:23

Different delimiters on different lines in the same file for Databricks Spark

1 Answers1