Spark DataFrameWriterV2 example using Sqlite (Scala)
This article explains on an example how we can use DataFrameWriterV2 API introduced in Spark 3.0.0.
DataFrameWriterV2 is improvement over existing DataFrameWriter to optimise writes to external databases.
I am going to use this new API to create a new Sqlite database and store a small DataFrame into it for simplicity.
Note: Although this example uses Scala, it’s trivial to achieve the same with Python following the same principles defined here.
Requirements
- java (I am using classic v8 here)
- sbt - to build and run the project
Set up
Let’s start by creating a Scala project using sbt and provide project name.
$ sbt new scala/scala-seed.g8
...
name [Scala Seed Project]: dataframewriter_v2
Navigate to directory and remove sample file
$ cd dataframewriter_v2
# Remove unnecessary sample file
$ rm src/main/scala/example/Hello.scala
Add build.sbt
file with below content. This defines Spark, Sqlite
java libraries to download for build and run.
import Dependencies._
ThisBuild / scalaVersion := "2.13.8"
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / organization := "com.example"
ThisBuild / organizationName := "example"
lazy val root = (project in file("."))
.settings(
name := "dataframewriter_v2",
libraryDependencies += scalaTest % Test
)
// See https://www.scala-sbt.org/1.x/docs/Using-Sonatype.html for instructions on how to publish to Sonatype.
libraryDependencies ++= Seq("org.apache.spark" %% "spark-core"% "3.3.0",
"org.apache.spark" %% "spark-sql" % "3.3.0",
"org.apache.spark" %% "spark-hive" % "3.3.0")
libraryDependencies += "org.json4s" % "json4s-jackson_2.13" % "3.7.0-M11"
// https://mvnrepository.com/artifact/org.json4s/json4s-core
// https://mvnrepository.com/artifact/org.xerial/sqlite-jdbc
libraryDependencies += "org.xerial" % "sqlite-jdbc" % "3.39.3.0"
Add example file:
$ touch src/main/scala/example/DataFrameWriterV2Test.scala
and paste below into that file:
package example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog
object DataFrameWriterV2Test extends App {
val url = s"jdbc:sqlite:sample.db"
val spark = SparkSession.builder.master("local[*]").appName("V2 app").getOrCreate()
spark.conf.set("spark.sql.catalog.sqllite", classOf[JDBCTableCatalog].getName)
spark.conf.set("spark.sql.catalog.sqllite.url", url)
val df = spark.range(2)
df.writeTo("sqllite.tab1").create()
spark.sql("select * from sqllite.tab1").show()
}
DataFrameWriterV2 part explanation
It defines a new catalog called sqllite
and sets handler of that to JDBCTableCatalog
which is a generic class that is acting as an interface to JDBC backed data source.
We also defined a JDBC url
that points to sample.db
file in current directory.
These settings are needed so we can refer to V2 target table through catalog.
Although, it’s not obvious where DataFrameWriterV2
starts working, it kicks in when
we say df.writeTo
and if you check the type of that it will provide DataFrameWriterV2
(as
apposed to DataFrameWriter
usual)
scala> df.writeTo("sqllite.tab1")
val res4: org.apache.spark.sql.DataFrameWriterV2[Long] = org.apache.spark.sql.DataFrameWriterV2@df3a499
Execute
While inside dataframewriter_v2
directory, start sbt and compile the project.
This will load build.sbt
, resolve and download dependencies and compile the project.
$ sbt
...
sbt:dataframewriter_v2> reload
sbt:dataframewriter_v2> compile
Now, we are ready to run:
sbt:dataframewriter_v2> run
...
...
+---+
| id|
+---+
| 0|
| 1|
+---+
It creates a new Sqlite database called sample.db
in the local
directory and creates a new table called tab1
within that.
Then it runs select SQL from that table.
Check:
$ ls -l
total 28
-rw-rw-r-- 1 khalid khalid 878 Sep 25 17:58 build.sbt
drwxrwxr-x 3 khalid khalid 4096 Sep 25 17:43 project
**-rw-rw-r-- 1 khalid khalid 8192 Sep 25 17:59 sample.db**
drwxr-xr-x 2 khalid khalid 4096 Sep 25 17:59 spark-warehouse
drwxrwxr-x 4 khalid khalid 4096 Sep 25 17:41 src
drwxrwxr-x 7 khalid khalid 4096 Sep 25 17:59 target
Summary
This simple example explains how one can easily write data to a JDBC based
backend database using DataFrameWriterV2
and you can change it easily to
your database of choice by adding a jar for that database to dependencies
and providing a JDBC URL for that with all required secrets and options.