-1

We can read avro file using the below code,

val df = spark.read.format("com.databricks.spark.avro").load(path)

is it possible to read pdf files using Spark dataframes?

Aparna CM
  • 3
  • 1
  • 3
  • 3
    Possible duplicate of [How to read PDF files and xml files in Apache Spark scala?](https://stackoverflow.com/questions/42000832/how-to-read-pdf-files-and-xml-files-in-apache-spark-scala) – Shaido Oct 31 '18 at 05:23
  • Thank you, I want to know is it possible to read pdf files using Spark dataframes – Aparna CM Oct 31 '18 at 06:36
  • I think that currently you need to read the data as a binary file (RDD) and then convert it to a dataframe. See the relevant JIRA issue: https://issues.apache.org/jira/browse/SPARK-20528 – Shaido Oct 31 '18 at 06:47

1 Answers1

0

You cannot read a pdf and store in a df as it will cannot interrupt the columns of the dataframe(basically it doens't have a standard schema), so if you want to get some data from a pdf first convert that to csv or parquet and then you can read from that file and then create a dataframe as it has a defined schema

visit this gitbook to understand more on what are the available read formats which you can use to get the data as a Dataframe

DataFrameReader — Loading Data From External Data Sources

Pang
  • 9,564
  • 146
  • 81
  • 122
Sundeep Pidugu
  • 2,377
  • 2
  • 21
  • 43