0

I am learning to use Pyspark and at work I have been assigned a task that I can not solve for now.

There is a config.ini file where there are a series of configurations for spark as well as for the tables, routes and files that we are going to use.

With a structure like this for the rest of the values for example:

[paths]
path_file_orig=/Sources/dimension/revenues

[sep]
delimiter_coma = ,
delimiter_pipe = |

[tables]
table_name = nyse
partition_name = partition_date

One of the many problems I have is that I am asked to add in this config.ini the parameters of a new ingest to be performed.

We are going to get a file with fixed width and where each column has a width.

I am suggested that the config.ini file should have something like:

column_name, length, type

Because that file in the future can grow in size of columns and it is required that the dataframe is dynamic, that it takes from that config.ini file the names, the widths and that's it.

I was reading about how to load a file with a fixed width

df = spark.read.text("/tmp/sample.txt")
df.select(
    df.value.substr(1,3).alias('id'),
    df.value.substr(4,8).alias('date'),
    df.value.substr(12,3).alias('string'),
    df.value.substr(15,4).cast('integer').alias('integer')
).show()

taken from this question

But how do I generate the dataframe if the name and width of the columns come in the configuration file?

I was reading too some dynamic generation here but it doesn't take it from a file, it takes it from the same columns that the dataframe already has: example

Also, I'm missing how should I include the name and width of the columns in the config.ini because as I put at the beginning, I saw that the config file comes directly with the key and value.

When I invoke them, I put:

import configparser

conf = configparser.ConfigParser()
conf.read(config_file)

and from there the variables are set

source_file_path = conf.get('paths','source_file_path')
path_file_dest_one_path = conf.get('paths','path_file_dest_one_path')
del_option_test_one = conf.get('sep','delimiter_coma')

But in the case of the columns of the new ingest, as they are dynamic (in the config.ini), how do I configure them to take all the ones that come?

Ouroborus
  • 16,237
  • 4
  • 39
  • 62
alejomarchan
  • 368
  • 1
  • 10
  • 20

0 Answers0