You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved

Question

I am using DataBricks as a service on Azure. This is my cluster info :

I ran below command and everythings was o.

 %sql
 Select 
    * 
 from db_xxxxx.t_fxxxxxxxxx
 limit 10

Then I have updated some rows in above table. When I run above command again i have this error :

    Error in SQL statement: SparkException: Job aborted due to stage failure: Task 3 in stage 2823.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2823.0 (TID 158824, 10.11.49.6, executor 14): com.databricks.sql.io.FileReadException: Error while reading file abfss:REDACTED_LOCAL_PART@storxfadev0501.dfs.core.windows.net/xsi-ed-faits/t_fait_xxxxxxxxxxx/_delta_log/00000000000000000022.json. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:286)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:251)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:354)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: java.io.FileNotFoundException: HEAD https://storxfadev0501.dfs.core.windows.net/devdledxsi01/xsi-ed-faits/t_fait_photo_impact/_delta_log/00000000000000000022.json?timeout=90
    StatusCode=404
    StatusDescription=The specified path does not exist.
    ErrorCode=
    ErrorMessage=
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:912)
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:169)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
        at com.databricks.spark.metrics.FileSystemWithMetrics.open(FileSystemWithMetrics.scala:282)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
        at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
        at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.readFile(JsonDataSource.scala:134)
        at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:138)
        at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:136)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:134)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:235)
        ... 26 more
    Caused by: HEAD https://storxfadev0501.dfs.core.windows.net/devdledxsi01/xsi-ed-faits/t_fait_photo_impact/_delta_log/00000000000000000022.json?timeout=90
    StatusCode=404
    StatusDescription=The specified path does not exist.
    ErrorCode=
    ErrorMessage=
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:134)
        at shaded.databricks.v20180920_b

score 4 · Answer 1 · answered Jul 13 '22 at 18:27

In summary, you can either refresh the table (previous to execution ) name or restart the cluster

spark.sql("refresh TABLE schema.table")

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.

score 3 · Accepted Answer · answered Sep 07 '20 at 10:13

3

This is expected behaviour when you update some rows in the table and immediately querying the table.

From the error message: It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

To resolve this issue, refresh all cached entries that are associated with the table.

REFRESH TABLE [db_name.]table_name

Refresh all cached entries associated with the table. If the table was previously cached, then it would be cached lazily the next time it is scanned.

answered Sep 07 '20 at 10:13

CHEEKATLAPRADEEP

12,191
1
19
42

9

this answer did not work for me. I refreshed my table but i had still the same problem. I restarted my cluster and after that everythings was ok. I will give you thumbs up. – Ardalan Shahgholi Oct 08 '20 at 13:16
This doesn't solve the issue – naveen ashok May 07 '22 at 06:31

You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved

2 Answers2

Linked