0

I am trying to do store a simple string as a text file in datalakeGen2 with python code written in synapse notebook. But it doesn't seems to be straight forward.

I tried to convert the text into rdd and then store:

from pyspark import SparkConf
from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
str = "test string"

text_path = adls_path + 'test.xml'

rdd_text = sc.parallelize(list(str)).collect()
# type(rdd_text)

rdd_text.saveAsTextFile(text_path)

but it gives out error as:

AttributeError: 'list' object has no attribute 'saveAsTextFile'
Traceback (most recent call last):

AttributeError: 'list' object has no attribute 'saveAsTextFile'
CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
Bruce
  • 75
  • 13

1 Answers1

0

enter image description here As python rdd_text = sc.parallelize(list(str)).collect() so here, you are results are stored in the form of list in the rdd_text. As it a normal python statement because collect() returns a list.

RDD is a distributed data structure and basic abstraction in spark, which is immutable.

For eg, remove() or append() are the objects of lists in python so as to add or remove element -as such save saveAsTextFile is the object of RDD to write the file.

As in the below picture you can see tuple() has no attribute as append because they are immutable so are RDD. enter image description here

Hence, instead of python rdd_text = sc.parallelize(list(str)).collect() could use python rdd_text = sc.parallelize(list(str)) so it wont store the result as a List.

from pyspark import SparkConf
from pyspark import SparkContext

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

string = "test string"
adls_path="abfss://data@xxxxxxxx.dfs.core.windows.net/symbolexcel.xlsx"

text_path = adls_path  + 'test.xlsx'
rdd_text = sc.parallelize(list(string))

rdd_text.saveAsTextFile(text_path)
IpsitaDash-MT
  • 1,326
  • 1
  • 3
  • 7
  • Thanks for helping out! I teste the code that you answer, and it stores the text as several blob files, and the file name that I have given to it became the folder's name. Are there any way to resolve this? – Bruce Jul 07 '21 at 14:55
  • Here is the document which can be followed to achieve the files in one partition as we have to use `write()`: https://mungingdata.com/apache-spark/output-one-file-csv-parquet/, for the renaming the folder : https://stackoverflow.com/questions/54101135/how-do-i-rename-the-file-that-was-saved-on-a-datalake-in-azure – IpsitaDash-MT Jul 12 '21 at 08:51