I'm trying to do a very basic thing, but I get an error for which I can't find a solution.
What I want to do is to be able to execute SQL queries on a dataframe. To do this, I saw in the documentation here, you need to create a view from the dataframe. The problem is that although I do exactly what is written in the documentation, IntelliJ it keeps giving me a compile error.
This is my code:
@Override
public Long execute() {
log.info("Starting processing query");
Instant start = Instant.now();
Dataset<Row> dataframe = this.hdfsIO.readParquetAsDataframe(vaccineAdministrationSummaryFile);
// Dataset<Row> dataframe = this.sparkSession.read().parquet(this.hdfsUrl + inputDir + "/" + filename);
dataframe.createOrReplaceTempView("query");
Dataset<Row> sqlDF = sparkSession.sql("SELECT * FROM query");
The line sparkSession.sql(SELECT * FROM query);
gives me an error and a warning:
- Warning: No data sources are configured to run this SQL and provide advanced code assistance. Disable this inspection via problem menu (alt+enter).
- Error: Unable to resolve table 'query' .
Why does this happen if my code is pretty much the same as the one in the documentation?
Running sparkSession.catalog().listTables().show(false);
after dataframe.createOrReplaceTempView("query");
i get this:
21/05/31 09:36:29 INFO CodeGenerator: Code generated in 27.591936 ms
21/05/31 09:36:29 INFO CodeGenerator: Code generated in 11.52196 ms
21/05/31 09:36:29 INFO CodeGenerator: Code generated in 15.643281 ms
+-----+--------+-----------+---------+-----------+
|name |database|description|tableType|isTemporary|
+-----+--------+-----------+---------+-----------+
|query|null |null |TEMPORARY|true |
+-----+--------+-----------+---------+-----------+
Istead, running the same command before dataframe.createOrReplaceTempView("query");
i get this:
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+
The function readParquetAsDataframe()
is defined as follows, the string passed is the path of the file in the hdfs:
public Dataset<Row> readParquetAsDataframe(String filename){
return this.sparkSession.read().parquet(this.hdfsUrl + inputDir + "/" + filename);
}
I also imagined that a problem could lie in how I build the SparkSession because I don't specify any configuration, but if so I wouldn't even know how to configure it properly. The SparkSession is build as follows:
SparkSession spark = SparkSession
.builder()
.appName("Project 1")
.getOrCreate();