Python Databricks: Is there any way to read text files inside a tar.gz folder saved in a blob storage?

Question

I have a .txt file inside a tar.gz folder in azure blob storage. Is there any way to read the contents of the .txt file in azure databricks without extracting the tar.gz folder?

Hi, check this thread if it of any help. https://stackoverflow.com/questions/70298817/read-gz-files-inside-tar-files-without-extracting — NiharikaMoola-MT, Mar 16 '22 at 12:14
See if this helps https://stackoverflow.com/questions/60104770/pyspark-load-a-tar-gz-file-into-a-dataframe-and-filter-by-filename — Dipanjan Mallick, Mar 16 '22 at 16:59

score 0 · Answer 1 · answered Apr 04 '22 at 16:18

Thank you DKNY for sharing your valuable suggestions. Posting the same as answer to help other community members.

Use of data bricks to perform the required operation

unzip the folder into temporary location using bash commands

%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;

Above code will unzip all files with extension *.tar.gz in source to destination location. If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable. This can be achieved in %pyspark

import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'

Use following methods to load file, in assumption the content in *.txt file:

DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.txt')

Python Databricks: Is there any way to read text files inside a tar.gz folder saved in a blob storage?

1 Answers1