Spark - how to get filename with parent folder from dataframe column

Question

I am using pyspark as code language. I added column to get filename with path.

from pyspark.sql.functions import input_file_name
data = data.withColumn("sourcefile",input_file_name())

I want to retrieve only filename with it's parent folder from this column. Please help.

Example:

Inputfilename = "adl://dotdot.com/ingest/marketing/abc.json"

What output I am looking for is:

marketing/abc.json

Note: String operation I can do. The filepath column is part of dataframe.

score 10 · Accepted Answer · answered May 18 '18 at 11:17

10

If you want to keep the value in a dataframe column you could use the pyspark.sql.function regexp_extract. You can apply it to the column with the value of path and passing the regular expression required to extract the desired part:

data = data.withColumn("sourcefile",input_file_name())

regex_str = "[\/]([^\/]+[\/][^\/]+)$"
data = data.withColumn("sourcefile", regexp_extract("sourcefile",regex_str,1))

answered May 18 '18 at 11:17

Marcial Gonzalez

266
2
7

Thanks for the answer. What if I would like to keep last 2 parent directories instead of just one? For e.g. ingest/marketing/abc.json. Can this be parameterized to take total number of directories to return? – Aravind Yarram Oct 25 '22 at 00:22

Steven · Answer 2 · 2018-05-18T09:32:16.553

0

I think that what you are looking for is :

sc.wholeTextFiles('path/to/files').map(
    lambda x : ( '/'.join(x[0].split('/')[-2:]), x[1])
)

This create a rdd with 2 columns, 1st one is the path to file, second one is the content of the file. That is the only way to link a path and a content in spark. Other method exists in Hive for example.

edited May 18 '18 at 09:32

answered May 17 '18 at 16:21

Steven

14,048
6
38
73

I want to keep `files/file1.json` as value in dataframe column – Hemant Chandurkar May 18 '18 at 09:28

Spark - how to get filename with parent folder from dataframe column

2 Answers2