6

I am using DataBricks as a service on Azure. This is my cluster info :

enter image description here

I ran below command and everythings was o.

 %sql
 Select 
    * 
 from db_xxxxx.t_fxxxxxxxxx
 limit 10

Then I have updated some rows in above table. When I run above command again i have this error :

    Error in SQL statement: SparkException: Job aborted due to stage failure: Task 3 in stage 2823.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2823.0 (TID 158824, 10.11.49.6, executor 14): com.databricks.sql.io.FileReadException: Error while reading file abfss:REDACTED_LOCAL_PART@storxfadev0501.dfs.core.windows.net/xsi-ed-faits/t_fait_xxxxxxxxxxx/_delta_log/00000000000000000022.json. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:286)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:251)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:354)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: java.io.FileNotFoundException: HEAD https://storxfadev0501.dfs.core.windows.net/devdledxsi01/xsi-ed-faits/t_fait_photo_impact/_delta_log/00000000000000000022.json?timeout=90
    StatusCode=404
    StatusDescription=The specified path does not exist.
    ErrorCode=
    ErrorMessage=
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:912)
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:169)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
        at com.databricks.spark.metrics.FileSystemWithMetrics.open(FileSystemWithMetrics.scala:282)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
        at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
        at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.readFile(JsonDataSource.scala:134)
        at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:138)
        at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:136)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:134)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:235)
        ... 26 more
    Caused by: HEAD https://storxfadev0501.dfs.core.windows.net/devdledxsi01/xsi-ed-faits/t_fait_photo_impact/_delta_log/00000000000000000022.json?timeout=90
    StatusCode=404
    StatusDescription=The specified path does not exist.
    ErrorCode=
    ErrorMessage=
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:134)
        at shaded.databricks.v20180920_b
Ardalan Shahgholi
  • 11,967
  • 21
  • 108
  • 144

2 Answers2

4

In summary, you can either refresh the table (previous to execution ) name or restart the cluster

spark.sql("refresh TABLE schema.table")

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.

Jorge Tovar
  • 1,374
  • 12
  • 17
3

This is expected behaviour when you update some rows in the table and immediately querying the table.

From the error message: It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

To resolve this issue, refresh all cached entries that are associated with the table.

REFRESH TABLE [db_name.]table_name

Refresh all cached entries associated with the table. If the table was previously cached, then it would be cached lazily the next time it is scanned.

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42