Read pdf file in apache spark dataframes

Question

We can read avro file using the below code,

val df = spark.read.format("com.databricks.spark.avro").load(path)

is it possible to read pdf files using Spark dataframes?

Possible duplicate of [How to read PDF files and xml files in Apache Spark scala?](https://stackoverflow.com/questions/42000832/how-to-read-pdf-files-and-xml-files-in-apache-spark-scala) — Shaido, Oct 31 '18 at 05:23
Thank you, I want to know is it possible to read pdf files using Spark dataframes — Aparna CM, Oct 31 '18 at 06:36
I think that currently you need to read the data as a binary file (RDD) and then convert it to a dataframe. See the relevant JIRA issue: https://issues.apache.org/jira/browse/SPARK-20528 — Shaido, Oct 31 '18 at 06:47

score 0 · Accepted Answer · edited Nov 02 '18 at 08:53

You cannot read a pdf and store in a df as it will cannot interrupt the columns of the dataframe(basically it doens't have a standard schema), so if you want to get some data from a pdf first convert that to csv or parquet and then you can read from that file and then create a dataframe as it has a defined schema

visit this gitbook to understand more on what are the available read formats which you can use to get the data as a Dataframe

DataFrameReader — Loading Data From External Data Sources

Read pdf file in apache spark dataframes

1 Answers1