1

I want to know are there any ways to import the file without using transform_df or transform in code repository.

Basically I want to extract the data from the dataset and return all the values in terms of list. If I use transform or transform_df decorators then I won't be able to access that input file while calling the return function.

  • Have you tried using `df.collect()`? This would allow you to convert dataframe to python list of rows – proggeo Aug 05 '21 at 10:32
  • once the dataset is imported without transform or transform_df then i can use df.collect() to get values in that dataframe but how to import dataset without transform or transform_df ?? – Monica Gaddipati Aug 05 '21 at 11:10
  • what are you trying to achieve by using it outside of transform/transform_df? – proggeo Aug 05 '21 at 11:54
  • https://stackoverflow.com/questions/64318411/how-to-access-the-data-frame-without-my-compute-function?rq=1 - this is likely related – proggeo Aug 09 '21 at 14:30

1 Answers1

0

Are you trying to access the raw files in the dataset? That is possible using the filesystem API. Search your stack's documentation for "Raw File Access" wher eyou can find example python code. You still use the @transform decorator, except instead of calling .dataframe() you call .filesystem(). Here's some example code.

import csv
with hair_eye_color.filesystem().open('students.csv') as f:
    reader = csv.reader(f, delimiter=',')
    next(reader)
    next(reader)
# ['id', 'hair', 'eye', 'sex']
# ['1', 'brown', 'brown', 'M']

You can create and a Spark dataframe using the file data and write it the output.

Kellen Donohue
  • 777
  • 8
  • 17