0

I am new in apache spark. I create the schema and data frame and it show me result but the format was not good and it so messy. Hardly I can read the line. So i want to show my result in pandas format. I attached the screen shot of my data frame result. But i don't know how to show my result in pandas format.

Here's my code

from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import * 
from IPython.display import display 
import pandas as pd 
import gzip

schema = StructType([StructField("crimeid", StringType(), True), 
                     StructField("Month", StringType(), True), 
                     StructField("Reported_by", StringType(), True),
                     StructField("Falls_within", StringType(), True), 
                     StructField("Longitude", FloatType(), True), 
                     StructField("Latitue", FloatType(), True), 
                     StructField("Location", StringType(), True),
                     StructField("LSOA_code", StringType(), True),
                     StructField("LSOA_name", StringType(), True),
                     StructField("Crime_type", StringType(), True),
                     StructField("Outcome_type", StringType(), True),
                    ])

df = spark.read.csv("crimes.gz",header=False,schema=schema)
df.printSchema()

PATH = "crimes.gz"
csvfile = spark.read.format("csv")\
.option("header", "false")\
.schema(schema)\
.load(PATH)
df1 =csvfile.show()

it shows the result like below

enter image description here

but in want this data pandas form

Thanks

mck
  • 40,932
  • 13
  • 35
  • 50
  • Does this answer your question? [Convert a spark DataFrame to pandas DF](https://stackoverflow.com/questions/50958721/convert-a-spark-dataframe-to-pandas-df) – SMaZ Dec 07 '20 at 22:50
  • you can also just paste it in any editor or excel and it won't wrap. – jayrythium Dec 08 '20 at 13:40
  • May be you can use df1 =csvfile.show(truncate=False), this will show your full output and you can read it in a better way – Sachin Tiwari Jun 29 '22 at 06:54

2 Answers2

0

You can try showing them vertically per row, or truncate big names if you like:

df.show(2, vertical=True)
df.show(2, truncate=4, vertical=True)
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
IknewIt
  • 3
  • 1
-1

Please try:

from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import * 
from IPython.display import display 
import pandas as pd 
import gzip

schema = StructType([StructField("crimeid", StringType(), True), 
                     StructField("Month", StringType(), True), 
                     StructField("Reported_by", StringType(), True),
                     StructField("Falls_within", StringType(), True), 
                     StructField("Longitude", FloatType(), True), 
                     StructField("Latitue", FloatType(), True), 
                     StructField("Location", StringType(), True),
                     StructField("LSOA_code", StringType(), True),
                     StructField("LSOA_name", StringType(), True),
                     StructField("Crime_type", StringType(), True),
                     StructField("Outcome_type", StringType(), True),
                    ])

df = spark.read.csv("crimes.gz",header=False,schema=schema)
df.printSchema()

pandasDF = df.toPandas() # transform PySpark dataframe in Pandas dataframe
print(pandasDF.head()) # print 5 first rows
  • 2
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Jeremy Caney Jul 03 '22 at 00:46