1

I have a .txt file inside a tar.gz folder in azure blob storage. Is there any way to read the contents of the .txt file in azure databricks without extracting the tar.gz folder?

  • Hi, check this thread if it of any help. https://stackoverflow.com/questions/70298817/read-gz-files-inside-tar-files-without-extracting – NiharikaMoola-MT Mar 16 '22 at 12:14
  • See if this helps https://stackoverflow.com/questions/60104770/pyspark-load-a-tar-gz-file-into-a-dataframe-and-filter-by-filename – Dipanjan Mallick Mar 16 '22 at 16:59

1 Answers1

0

Thank you DKNY for sharing your valuable suggestions. Posting the same as answer to help other community members.

Use of data bricks to perform the required operation

  1. unzip the folder into temporary location using bash commands
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
  1. Above code will unzip all files with extension *.tar.gz in source to destination location. If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable. This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
  1. Use following methods to load file, in assumption the content in *.txt file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.txt')
Madhuraj Vadde
  • 1,099
  • 1
  • 5
  • 13