0

I need to read a bunch of csv files in pyspark in databricks.

spark.read.option("header", True).format("csv").load('path/2021', 'path/2020')

Is it possible to make it dynamic with some variables like this?

years_to_query = ['2020', '2021']

path = 'path'

I'm trying to concatenate but I'm not getting the desired result.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
JJ1603
  • 586
  • 1
  • 7
  • 16
  • Concatenation should work easily here. I don't know what you are doing wrong. – mkrieger1 Feb 07 '22 at 23:46
  • Generally speaking, you shouldn't use string concatenation to join file-system path components, use `os.path.join()`, which isn't OS-specific. – martineau Feb 08 '22 at 00:31
  • You don't need `zip()` or `Itertools.repeat()` here as @azelcer suggested — use `load(*(os.path.join(path, year) for year in years_to_query))` instead which makes use of a [generator expression](https://docs.python.org/3/reference/expressions.html#generator-expressions) to create the arguments. – martineau Feb 08 '22 at 00:40
  • You could also use the [`pathlib`](https://docs.python.org/3/library/pathlib.html#module-pathlib) module and do it like this: `load(*((Path(path)/year) for year in years_to_query))`. – martineau Feb 08 '22 at 00:48

0 Answers0