3

I am trying to write a Spark dataframe to a sqlite3 database in Python using sqlite-jdbc from xerial and this example. I am getting the error

java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such table: test)

The database file hello.db is actually created with a table test which has the schema

sqlite> .schema test
CREATE TABLE test (age BIGINT , name TEXT );

I am running it with spark-submit --jars ../extras/sqlite-jdbc-3.8.11.2.jar example.py in order for it to find the driver.

I am running Spark 1.6.0.

(Hopefully) reproducible example

import os
os.environ["SPARK_HOME"] = "/usr/lib/spark"
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

config = {
            "spark.cores.max": "5",
            "spark.master" : "spark://master2:7077",
            "spark.python.profile": "false",
            "spark.ui.enabled": "false",
            "spark.executor.extraClassPath": "../extras/sqlite-jdbc-3.8.11.2.jar",
            "spark.driver.extraClassPath": "../extras/sqlite-jdbc-3.8.11.2.jar",
            "spark.jars": "../extras/sqlite-jdbc-3.8.11.2.jar"
        }

conf = SparkConf()
for key, value in config.iteritems():
        conf = conf.set(key, value)

sc = SparkContext(appName="test", conf=conf)
sqlcontext = SQLContext(sc)

d = [{'name': 'Alice', 'age': 31}]
df = sqlcontext.createDataFrame(d)

url = "jdbc:sqlite:hello.db"
df.write.jdbc(url=url, table="test", mode="overwrite", properties={"driver":"org.sqlite.JDBC"})
Community
  • 1
  • 1
bytesinflight
  • 1,624
  • 2
  • 17
  • 28
  • 1
    I am confused how it suppose to work. Every executor performs writer separately so using file based database doesn't make much sense. – zero323 Sep 22 '16 at 16:02
  • I am confused as well :). I see what you mean, and that I guess means that the driver creates the table, which the exectors can't see ... Makes more sense with an external mysql server then? Is there a workaround for this case or do I have to collect() and do it manually? – bytesinflight Sep 22 '16 at 17:08
  • 1
    If you want this for anything else than simples experiments you'll need at least proper database server (in general a sink that can be accessed in parallel from each executor). – zero323 Sep 22 '16 at 17:12

1 Answers1

3

In general each Spark executor performs reads and writes separately data source and sink has to be accessible from each worker node. In general it makes SQLite rather useless in this scenario (it is great for local lookups though).

If you want to store the output in the database in non-local mode you'll need a proper database server.

zero323
  • 322,348
  • 103
  • 959
  • 935