I have a .txt file inside a tar.gz folder in azure blob storage. Is there any way to read the contents of the .txt file in azure databricks without extracting the tar.gz folder?
Asked
Active
Viewed 822 times
1
-
Hi, check this thread if it of any help. https://stackoverflow.com/questions/70298817/read-gz-files-inside-tar-files-without-extracting – NiharikaMoola-MT Mar 16 '22 at 12:14
-
See if this helps https://stackoverflow.com/questions/60104770/pyspark-load-a-tar-gz-file-into-a-dataframe-and-filter-by-filename – Dipanjan Mallick Mar 16 '22 at 16:59
1 Answers
0
Thank you DKNY for sharing your valuable suggestions. Posting the same as answer to help other community members.
Use of data bricks to perform the required operation
- unzip the folder into temporary location using bash commands
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
- Above code will unzip all files with extension *.tar.gz in source to destination location. If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable. This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
- Use following methods to load file, in assumption the content in *.txt file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.txt')

Madhuraj Vadde
- 1,099
- 1
- 5
- 13