3

I am trying to use tensorflow (2.2) data validation (TFDV version: 0.22.2) to visualize data on databricks GPU cluster.

From databricks notebook, I am running the code at : https://nbviewer.jupyter.org/github/tensorflow/tfx/blob/master/docs/tutorials/data_validation/tfdv_basic.ipynb

But, when I run

  tfdv.visualize_statistics(train_stats)

I got:

 <IPython.core.display.HTML object>

no html webpage shown.

I have tried to update matlibplot but it does not work.

I have also tried https://python-forum.io/Thread-How-to-display-IPython-core-display-HTML-object and How to embed HTML into IPython output?

but still no html shown.

Could anybody help me about this ?

thanks

UPDATE

I have tried :

html = tfdv.visualize_statistics(train_stats).data

got:

<IPython.core.display.HTML object>
AttributeError: 'NoneType' object has no attribute 'data'
---------------------------------------------------------------------------
 AttributeError                            Traceback (most recent call last)
 <command-2488671> in <module>

----> 1 html = tfdv.visualize_statistics(train_stats).data

 AttributeError: 'NoneType' object has no attribute 'data'
user3448011
  • 1,469
  • 1
  • 17
  • 39

3 Answers3

1

This can be fixed by importing the function that generates the HTML objects and calling those instead of the visualize functions. Then, visualize those functions with the DataBricks displayHTML function.

from tensorflow_data_validation.utils.display_util import get_statistics_html
displayHTML(get_statistics_html(train_stats))

The issue is that the tfdv utility notebook imports Ipython display functionality, and overrides the DataBricks display function with the Ipython display function inside visualize functions.

try:
  # pylint: disable=g-import-not-at-top
  from IPython.display import display
  from IPython.display import HTML
except ImportError as e:

The display_anomalies function has a similar issue and can be solved by importing the get_anomalies_dataframe function directly and displaying the resulting pandas dataframe.

Ge0Dude
  • 36
  • 4
0

this works perfectly in a jupyter notebook, which is required to visualize this <IPython.core.display.HTML object>

You can get the HTML code with:

html = tfdv.visualize_statistics(train_stats).data
Pixou
  • 1,719
  • 13
  • 23
  • 1
    I have tried "html = tfdv.visualize_statistics(train_stats).data", but got error: AttributeError: 'NoneType' object has no attribute 'data' – user3448011 Jul 23 '20 at 18:25
0

I was able to display the statistics with a workaround:

I copy-pasted to my databricks notebook most of the code in this page and modified this function so that instead of displaying the html it returned it. Like this:

def visualize_statistics(
    lhs_statistics: statistics_pb2.DatasetFeatureStatisticsList,
    rhs_statistics: Optional[
        statistics_pb2.DatasetFeatureStatisticsList] = None,
    lhs_name: Text = 'lhs_statistics',
    rhs_name: Text = 'rhs_statistics',
    allowlist_features: Optional[List[types.FeaturePath]] = None,
    denylist_features: Optional[List[types.FeaturePath]] = None) -> None:
  """Visualize the input statistics using Facets.
  Args:
    lhs_statistics: A DatasetFeatureStatisticsList protocol buffer.
    rhs_statistics: An optional DatasetFeatureStatisticsList protocol buffer to
      compare with lhs_statistics.
    lhs_name: Name of the lhs_statistics dataset.
    rhs_name: Name of the rhs_statistics dataset.
    allowlist_features: Set of features to be visualized.
    denylist_features: Set of features to ignore for visualization.
  Raises:
    TypeError: If the input argument is not of the expected type.
    ValueError: If the input statistics protos does not have only one dataset.
  """
  assert (not allowlist_features or not denylist_features), (
      'Only specify one of allowlist_features and denylist_features.')
  html = get_statistics_html(lhs_statistics, rhs_statistics, lhs_name, rhs_name,
                             allowlist_features, denylist_features)
  return html

After that you can simply do:

displayHTML(visualize_statistics(train_stats))

I know, it's not ideal, but it worked.

rv123
  • 476
  • 1
  • 3
  • 14