11

Spark v2.4

spark = SparkSession \
    .builder \
    .master('local[15]') \
    .appName('Notebook') \
    .config('spark.sql.debug.maxToStringFields', 2000) \
    .config('spark.sql.maxPlanStringLength', 2000) \
    .config('spark.debug.maxToStringFields', 2000) \
    .getOrCreate()

df = spark.createDataFrame(spark.range(1000).rdd.map(lambda x: range(100)))
df.repartition(1).write.mode('overwrite').parquet('test.parquet')

df = spark.read.parquet('test.parquet')
df.select('*').explain()

== Physical Plan ==

 ReadSchema: struct<_1:bigint,_2:bigint,_3:bigint,_4:bigint,_5:bigint,_6:bigint,_7:bigint,_8:bigint,_9:bigint,...

Note: spark.debug.maxToStringFields helped a bit by expanding FileScan parquet [_1#302L,_2#303L,... 76 more fields], but not the schema part.

Note2: I am not only interested in the ReadSchema, but also PartitionFilters, PushedFilters ... which are all truncated.

Update

Spark 3.0 introduced explain('formatted') which layouts the information differently and no truncation is applied.

colinfang
  • 20,909
  • 19
  • 90
  • 173
  • Possible duplicate of [Spark: "Truncated the string representation of a plan since it was too large." Warning when using manually created aggregation expression](https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w) – user10938362 Mar 20 '19 at 14:07
  • where are you running this code ? – eliasah Mar 20 '19 at 15:22
  • @user10938362 I updated the question as `spark.debug.maxToStringFields` doesn't help in this case. – colinfang Mar 20 '19 at 15:22
  • @eliasah Both submit and python repl – colinfang Mar 20 '19 at 15:28
  • and why don't you use df.printSchema() or df.schema.json() ? I'm not sure I understand why is this an issue... – eliasah Mar 20 '19 at 15:34
  • Indeed, it would be nice to see all the applied `PartitionFilters` and `PushedFilters`, seeing the schema is not that important. – boyangeor Apr 01 '19 at 07:51

3 Answers3

8

Spark 3.0 introduced explain('formatted') which layouts the information differently and no truncation is applied.

colinfang
  • 20,909
  • 19
  • 90
  • 173
5

I am afraid there is no easy way

https://github.com/apache/spark/blob/v2.4.2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L57

It is hard coded to be no more than 100 chars

  override def simpleString: String = {
    val metadataEntries = metadata.toSeq.sorted.map {
      case (key, value) =>
        key + ": " + StringUtils.abbreviate(redact(value), 100)
    }

In the end I have been using

def full_file_meta(f: FileSourceScanExec) = {
    val metadataEntries = f.metadata.toSeq.sorted.flatMap {
      case (key, value) if Set(
          "Location", "PartitionCount",
          "PartitionFilters", "PushedFilters"
      ).contains(key) =>
        Some(key + ": " + value.toString)
      case other => None
    }

    val metadataStr = metadataEntries.mkString("[\n  ", ",\n  ", "\n]")
    s"${f.nodeNamePrefix}${f.nodeName}$metadataStr"

}

val ep = data.queryExecution.executedPlan

print(ep.flatMap {
    case f: FileSourceScanExec => full_file_meta(f)::Nil
    case other => Nil
}.mkString(",\n"))

It is a hack and better than nothing.

colinfang
  • 20,909
  • 19
  • 90
  • 173
  • @ colinfang see my answer - Spark has a built-in pivot away from `simpleString` available, no hack required. – bsplosion May 07 '19 at 15:03
  • 1
    Very nice solution, it's almost impossible to check whether predicate is pushed down to file scanner correctly without this hack. – Kien Truong Apr 08 '20 at 13:18
2

In PySpark, all you have to do is explain with extended mode:

df.select('*').explain(True)

This calls the non-simpleString implementation - see the DataFrame#explain docs for more info.


Running the code in the question results in the following output with df.select('*').explain(extended=True):

== Parsed Logical Plan ==
'Project [*]
+- Relation[_1#304L,_2#305L,_3#306L,_4#307L,_5#308L,_6#309L,_7#310L,_8#311L,_9#312L,_10#313L,_11#314L,_12#315L,_13#316L,_14#317L,_15#318L,_16#319L,_17#320L,_18#321L,_19#322L,_20#323L,_21#324L,_22#325L,_23#326L,_24#327L,_25#328L,_26#329L,_27#330L,_28#331L,_29#332L,_30#333L,_31#334L,_32#335L,_33#336L,_34#337L,_35#338L,_36#339L,_37#340L,_38#341L,_39#342L,_40#343L,_41#344L,_42#345L,_43#346L,_44#347L,_45#348L,_46#349L,_47#350L,_48#351L,_49#352L,_50#353L,_51#354L,_52#355L,_53#356L,_54#357L,_55#358L,_56#359L,_57#360L,_58#361L,_59#362L,_60#363L,_61#364L,_62#365L,_63#366L,_64#367L,_65#368L,_66#369L,_67#370L,_68#371L,_69#372L,_70#373L,_71#374L,_72#375L,_73#376L,_74#377L,_75#378L,_76#379L,_77#380L,_78#381L,_79#382L,_80#383L,_81#384L,_82#385L,_83#386L,_84#387L,_85#388L,_86#389L,_87#390L,_88#391L,_89#392L,_90#393L,_91#394L,_92#395L,_93#396L,_94#397L,_95#398L,_96#399L,_97#400L,_98#401L,_99#402L,_100#403L] parquet
<BLANKLINE>
== Analyzed Logical Plan ==
_1: bigint, _2: bigint, _3: bigint, _4: bigint, _5: bigint, _6: bigint, _7: bigint, _8: bigint, _9: bigint, _10: bigint, _11: bigint, _12: bigint, _13: bigint, _14: bigint, _15: bigint, _16: bigint, _17: bigint, _18: bigint, _19: bigint, _20: bigint, _21: bigint, _22: bigint, _23: bigint, _24: bigint, _25: bigint, _26: bigint, _27: bigint, _28: bigint, _29: bigint, _30: bigint, _31: bigint, _32: bigint, _33: bigint, _34: bigint, _35: bigint, _36: bigint, _37: bigint, _38: bigint, _39: bigint, _40: bigint, _41: bigint, _42: bigint, _43: bigint, _44: bigint, _45: bigint, _46: bigint, _47: bigint, _48: bigint, _49: bigint, _50: bigint, _51: bigint, _52: bigint, _53: bigint, _54: bigint, _55: bigint, _56: bigint, _57: bigint, _58: bigint, _59: bigint, _60: bigint, _61: bigint, _62: bigint, _63: bigint, _64: bigint, _65: bigint, _66: bigint, _67: bigint, _68: bigint, _69: bigint, _70: bigint, _71: bigint, _72: bigint, _73: bigint, _74: bigint, _75: bigint, _76: bigint, _77: bigint, _78: bigint, _79: bigint, _80: bigint, _81: bigint, _82: bigint, _83: bigint, _84: bigint, _85: bigint, _86: bigint, _87: bigint, _88: bigint, _89: bigint, _90: bigint, _91: bigint, _92: bigint, _93: bigint, _94: bigint, _95: bigint, _96: bigint, _97: bigint, _98: bigint, _99: bigint, _100: bigint
Project [_1#304L, _2#305L, _3#306L, _4#307L, _5#308L, _6#309L, _7#310L, _8#311L, _9#312L, _10#313L, _11#314L, _12#315L, _13#316L, _14#317L, _15#318L, _16#319L, _17#320L, _18#321L, _19#322L, _20#323L, _21#324L, _22#325L, _23#326L, _24#327L, _25#328L, _26#329L, _27#330L, _28#331L, _29#332L, _30#333L, _31#334L, _32#335L, _33#336L, _34#337L, _35#338L, _36#339L, _37#340L, _38#341L, _39#342L, _40#343L, _41#344L, _42#345L, _43#346L, _44#347L, _45#348L, _46#349L, _47#350L, _48#351L, _49#352L, _50#353L, _51#354L, _52#355L, _53#356L, _54#357L, _55#358L, _56#359L, _57#360L, _58#361L, _59#362L, _60#363L, _61#364L, _62#365L, _63#366L, _64#367L, _65#368L, _66#369L, _67#370L, _68#371L, _69#372L, _70#373L, _71#374L, _72#375L, _73#376L, _74#377L, _75#378L, _76#379L, _77#380L, _78#381L, _79#382L, _80#383L, _81#384L, _82#385L, _83#386L, _84#387L, _85#388L, _86#389L, _87#390L, _88#391L, _89#392L, _90#393L, _91#394L, _92#395L, _93#396L, _94#397L, _95#398L, _96#399L, _97#400L, _98#401L, _99#402L, _100#403L]
+- Relation[_1#304L,_2#305L,_3#306L,_4#307L,_5#308L,_6#309L,_7#310L,_8#311L,_9#312L,_10#313L,_11#314L,_12#315L,_13#316L,_14#317L,_15#318L,_16#319L,_17#320L,_18#321L,_19#322L,_20#323L,_21#324L,_22#325L,_23#326L,_24#327L,_25#328L,_26#329L,_27#330L,_28#331L,_29#332L,_30#333L,_31#334L,_32#335L,_33#336L,_34#337L,_35#338L,_36#339L,_37#340L,_38#341L,_39#342L,_40#343L,_41#344L,_42#345L,_43#346L,_44#347L,_45#348L,_46#349L,_47#350L,_48#351L,_49#352L,_50#353L,_51#354L,_52#355L,_53#356L,_54#357L,_55#358L,_56#359L,_57#360L,_58#361L,_59#362L,_60#363L,_61#364L,_62#365L,_63#366L,_64#367L,_65#368L,_66#369L,_67#370L,_68#371L,_69#372L,_70#373L,_71#374L,_72#375L,_73#376L,_74#377L,_75#378L,_76#379L,_77#380L,_78#381L,_79#382L,_80#383L,_81#384L,_82#385L,_83#386L,_84#387L,_85#388L,_86#389L,_87#390L,_88#391L,_89#392L,_90#393L,_91#394L,_92#395L,_93#396L,_94#397L,_95#398L,_96#399L,_97#400L,_98#401L,_99#402L,_100#403L] parquet
<BLANKLINE>
== Optimized Logical Plan ==
Relation[_1#304L,_2#305L,_3#306L,_4#307L,_5#308L,_6#309L,_7#310L,_8#311L,_9#312L,_10#313L,_11#314L,_12#315L,_13#316L,_14#317L,_15#318L,_16#319L,_17#320L,_18#321L,_19#322L,_20#323L,_21#324L,_22#325L,_23#326L,_24#327L,_25#328L,_26#329L,_27#330L,_28#331L,_29#332L,_30#333L,_31#334L,_32#335L,_33#336L,_34#337L,_35#338L,_36#339L,_37#340L,_38#341L,_39#342L,_40#343L,_41#344L,_42#345L,_43#346L,_44#347L,_45#348L,_46#349L,_47#350L,_48#351L,_49#352L,_50#353L,_51#354L,_52#355L,_53#356L,_54#357L,_55#358L,_56#359L,_57#360L,_58#361L,_59#362L,_60#363L,_61#364L,_62#365L,_63#366L,_64#367L,_65#368L,_66#369L,_67#370L,_68#371L,_69#372L,_70#373L,_71#374L,_72#375L,_73#376L,_74#377L,_75#378L,_76#379L,_77#380L,_78#381L,_79#382L,_80#383L,_81#384L,_82#385L,_83#386L,_84#387L,_85#388L,_86#389L,_87#390L,_88#391L,_89#392L,_90#393L,_91#394L,_92#395L,_93#396L,_94#397L,_95#398L,_96#399L,_97#400L,_98#401L,_99#402L,_100#403L] parquet
<BLANKLINE>
== Physical Plan ==
*(1) FileScan parquet [_1#304L,_2#305L,_3#306L,_4#307L,_5#308L,_6#309L,_7#310L,_8#311L,_9#312L,_10#313L,_11#314L,_12#315L,_13#316L,_14#317L,_15#318L,_16#319L,_17#320L,_18#321L,_19#322L,_20#323L,_21#324L,_22#325L,_23#326L,_24#327L,_25#328L,_26#329L,_27#330L,_28#331L,_29#332L,_30#333L,_31#334L,_32#335L,_33#336L,_34#337L,_35#338L,_36#339L,_37#340L,_38#341L,_39#342L,_40#343L,_41#344L,_42#345L,_43#346L,_44#347L,_45#348L,_46#349L,_47#350L,_48#351L,_49#352L,_50#353L,_51#354L,_52#355L,_53#356L,_54#357L,_55#358L,_56#359L,_57#360L,_58#361L,_59#362L,_60#363L,_61#364L,_62#365L,_63#366L,_64#367L,_65#368L,_66#369L,_67#370L,_68#371L,_69#372L,_70#373L,_71#374L,_72#375L,_73#376L,_74#377L,_75#378L,_76#379L,_77#380L,_78#381L,_79#382L,_80#383L,_81#384L,_82#385L,_83#386L,_84#387L,_85#388L,_86#389L,_87#390L,_88#391L,_89#392L,_90#393L,_91#394L,_92#395L,_93#396L,_94#397L,_95#398L,_96#399L,_97#400L,_98#401L,_99#402L,_100#403L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/path/test.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:bigint,_2:bigint,_3:bigint,_4:bigint,_5:bigint,_6:bigint,_7:bigint,_8:bigint,_9:bigint,...

For some reason the ReadSchema property is always truncated while all other components accept the spark.debug.maxToStringFields setting appropriately.

bsplosion
  • 2,641
  • 27
  • 38
  • I am aware of the `simpleString` & non-`simpleString`, but neither helps here. `explain(True)` prints Optimization plan and logical plan in addition to the physical plan, but the character limit still applies. – colinfang May 07 '19 at 15:33
  • Very odd - we have `spark.debug.maxToStringFields` set to 999999 and have successfully output extremely large plans without any truncation. Are you sure you've applied the SparkConf correctly in the context where you run `explain(True)`? – bsplosion May 07 '19 at 16:32
  • There are 2 types of truncations, `spark.debug.maxToStringFields` helps with one type. The other type is hard coded so not possible to change. If you run the example in my question and paste the print out in the answer I might be able to tell if it works or not. – colinfang May 07 '19 at 16:42
  • Ah, you're specifically talking about the `ReadSchema` portion of the query plan, not the entire thing. That portion will always be truncated after a fixed length which doesn't appear to be tied to any conf I've seen. Note that all other portions of the query plan do respond to `maxToStringFields` as expected - I'll update the answer with a run of your code. – bsplosion May 07 '19 at 19:18
  • Exactly. The hard coded truncation also applies to "PartitionFilters" & "PushedFilters" which are not shown in the example. – colinfang May 07 '19 at 20:34