2

I was trying to iterate through a folder and get the names of files and paths of these files in DataBricks using Pyspark. And suddenly a thought came like if we could make the names of files as variable and assign the path to that respective file named variable. We could use dbutils to create widgets and assign the file name as parameter, to make things easier. So working on this process I came till obtaining the paths of files and filenames. But I couldn't figure out the variable creation and assigning the paths of the respective files in the respective file name variables Here's the code :

import pandas as pd
import os
list1 =[]
list2 =[]
directory='/dbfs/FileStore/tables'
dir='/FileStore/tables'
for filename in os.listdir(directory):
  if filename.endswith(".csv") or filename.endswith(".txt"):
    file_path=os.path.join(dir, filename)
    print(file_path)
    print(filename)
    list1.append(file_path)
    list2.append(filename)

Thanks in advance

younus
  • 412
  • 2
  • 10
  • 20
  • Possible duplicate of https://stackoverflow.com/questions/19122345/to-convert-string-to-variable-name. But the real question is why not use a dictionary instead with `filename` as key and `file_path` as value? – Conner M. Jan 05 '20 at 06:52
  • let's suppose I have 100 files in a folder and I just want to create a DataFrame for a single file, If I get to create variables in a dictionary, I must have to remember the keys in the dictionary, and if I get to have the variables with the file name and the file paths assigned to it, I could just pass the variable name and create the whole data frame with little effort –  younus Jan 05 '20 at 07:04
  • not sure what you mean by ..."remember the keys in the dictionary...", but assigning variables would also require memory overhead, more I think. A dictionary is the best data structure for this type of operation - it's compact, portable, and efficient . – Conner M. Jan 05 '20 at 07:10
  • Ok thank you for the suggestion, I will check on both results and update here, about the pros and cons –  younus Jan 05 '20 at 08:30

2 Answers2

2

If you're set on assigning paths to variables with the file name, then you can try:

...
for filename in os.listdir(directory):
  if filename.endswith(".csv") or filename.endswith(".txt"):
    file_path=os.path.join(dir, filename)
    print(file_path)
    print(filename)
    exec("%s = '%s'" % (filename, file_path))

Notice the additional set of quotes avoid syntax and name errors. However, this solution is still fraught with problems. For example, it looks like the call to exec takes the backslashes in a file path as unicode:

filename = 'file1'
filepath = '\maindir\foo'
exec("%s = '%s'" % (filename, filepath))
file1
'\\maindir\x0coo'

But a dictionary seems much better suited to his situation:

...
filenames_and_paths = {}
for filename in os.listdir(directory):
  if filename.endswith(".csv") or filename.endswith(".txt"):
    file_path=os.path.join(dir, filename)
    print(file_path)
    print(filename)
    filenames_and_paths[filename] = file_path

Not sure why you've created the two lists for the names and paths, but if they are needed you can also use a dictionary comprehension:

filenames_and_paths = {name:path for name,path in zip(list1, list2)}
Conner M.
  • 1,954
  • 3
  • 19
  • 29
  • 12 exec("%s = %s" % (filename, file_path)) SyntaxError: invalid syntax Traceback (most recent call last): File "/databricks/python/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 12, in exec("%s = %s" % (filename, file_path)) File "", line 1 Dept_data.csv = /FileStore/tables/Dept_data.csv ^ SyntaxError: invalid syntax –  younus Jan 05 '20 at 12:28
  • @younus, try adding an additional set of quotes. I've edited answer to include this, but I've also played around with this and have run into issues with backslashes being interpreted as unicode. There's no reason to implement what you're trying to do this way. A dictionary is the simply better. – Conner M. Jan 06 '20 at 01:50
0

With Pyspark I'd rather suggest using the Hadoop FS API to list files as os.listdir won't work with external buckets/storage.

Here is an example that you can adapt:

# access hadoop fs via the JVM
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
conf = sc._jsc.hadoopConfiguration()

# list directory
directory = Path("/dbfs/FileStore/tables/*.csv")
gs = directory.getFileSystem(conf).globStatus(directory)

# create tuples (filename, filepath), you can also filter specific files here...
paths = []
if gs:
    paths = [(f.getPath().getName(), f.getPath().toString()) for f in gs]

for filename, file_path in paths:
    # your process
blackbishop
  • 30,945
  • 11
  • 55
  • 76