8

when I use df.show() to view the pyspark dataframe in jupyter notebook

It show me that:

+---+-------+-------+-------+------+-----------+-----+-------------+-----+---------+----------+-----+-----------+-----------+--------+---------+-------+------------+---------+------------+---------+---------------+------------+---------------+---------+------------+
| Id|groupId|matchId|assists|boosts|damageDealt|DBNOs|headshotKills|heals|killPlace|killPoints|kills|killStreaks|longestKill|maxPlace|numGroups|revives|rideDistance|roadKills|swimDistance|teamKills|vehicleDestroys|walkDistance|weaponsAcquired|winPoints|winPlacePerc|
+---+-------+-------+-------+------+-----------+-----+-------------+-----+---------+----------+-----+-----------+-----------+--------+---------+-------+------------+---------+------------+---------+---------------+------------+---------------+---------+------------+
|  0|     24|      0|      0|     5|   247.3000|    2|            0|    4|       17|      1050|    2|          1|    65.3200|      29|       28|      1|    591.3000|        0|      0.0000|        0|              0|    782.4000|              4|     1458|      0.8571|
|  1| 440875|      1|      1|     0|    37.6500|    1|            1|    0|       45|      1072|    1|          1|    13.5500|      26|       23|      0|      0.0000|        0|      0.0000|        0|              0|    119.6000|              3|     1511|      0.0400|
|  2| 878242|      2|      0|     1|    93.7300|    1|            0|    2|       54|      1404|    0|          0|     0.0000|      28|       28|      1|      0.0000|        0|      0.0000|        0|              0|   3248.0000|              5|     1583|      0.7407|
|  3|1319841|      3|      0|     0|    95.8800|    0|            0|    0|       86|      1069|    0|          0|     0.0000|      97|       94|      0|      0.0000|        0|      0.0000|        0|              0|     21.4900|              1|     1489|      0.1146|
|  4|1757883|      4|      0|     1|     0.0000|    0|            0|    1|       58|      1034|    0|          0|     0.0000|      47|  

How can I get a formatted dataframe just like pandas dataframe to view the data more efficiently?

Talha Tayyab
  • 8,111
  • 25
  • 27
  • 44
sdy b
  • 81
  • 1
  • 1
  • 2
  • Possible duplicate of [Show DataFrame as table in iPython Notebook](https://stackoverflow.com/questions/26873127/show-dataframe-as-table-in-ipython-notebook) – Mohamed El-Touny Dec 11 '18 at 09:28
  • 1
    you can convert `spark` dataframe into `pandas` dataframe, but it will be a memory overhead if resulting dataframe is too large. you can check doc for `show` here http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show – frank Dec 11 '18 at 09:34

3 Answers3

10

You can use the ability to convert a pyspark dataframe directly to a pandas dataframe. The command for the same would be -

df.limit(10).toPandas()

This should directly yield the result as a pandas dataframe and you just need to have pandas package installed.

sat
  • 603
  • 3
  • 6
0

You have to use the below code

from IPython.display import display
import pandas as pd
import numpy as np

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

display(df)
Mohamed El-Touny
  • 347
  • 1
  • 4
  • 14
  • 1
    This does not answer the question. He wants to show a pyspark `Dataframe` in a formatted way (similar to how a pandas DataFrame can be shown). Note pandas and `pyspark` DatFrame's are not same! – pvy4917 Dec 11 '18 at 17:44
  • So the above mentioned code is correct for Pyspark also when he use jupyter notebook – Mohamed El-Touny Dec 12 '18 at 11:26
  • Thanks for your anser.But when I use Pyspark Dataframe show(),display doesn't work. – sdy b Dec 13 '18 at 03:36
  • This answer works fine. Don't call `df.show().display`, but (as shown in the answer) instead call `display(df)`. It works for Pandas or Spark DataFrame. – Kirk Broadhurst Feb 04 '21 at 11:38
0

As @sat mentioned in their answer you can use:

df.toPandas()

Or better to limit:

df.limit(10).toPandas()
# where 10 is the number of rows

to convert your dataframe into pandas dataframe.

However if you want to see your data in pyspark you can use :

df.show(10,truncate=False)

If you want to see each row of your dataframe individually then use:

df.show(10, vertical=True)

Also, you can find the total number of records with :

df.count()
Talha Tayyab
  • 8,111
  • 25
  • 27
  • 44