Spark and Postgresql
How to connect to apache spark to postgresql.
pyspark notebook and docker-compose
For this I will be using the jupyter all-spark notebook and define it with a postgres container in docker-compose. First I extend the all-spark dockerfile to include the jdbc jar:
Dockerfile
FROM jupyter/pyspark-notebook
RUN wget https://jdbc.postgresql.org/download/postgresql-42.2.17.jar
ENV SPARK_OPTS=$SPARK_OPTS:" --packages org.postgresql:postgresql:42.2.17"
The docker-compose file is then:
version: '3'
services:
db:
image: 'postgres'
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_DB=postgres
volumes:
- database-data:/var/lib/postgresql/data/
- ./data/:/data
ports:
- "5432:5432"
spark:
build: .
volumes:
- ./data/:/data
- ./notebooks/:/home/jovyan/notebook
environment:
- SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info --packages=io.delta:delta-core_2.12:0.7.0 --conf=spark.sql.extensions=io.delta .sql.DeltaSparkSessionExtension --conf=spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --packages org.postgresql:postgresql:42.2.17"
ports:
- "8888:8888"
volumes:
database-data:
data:
notebooks:
juypter notebook
We can now connect to postgres through pyspark:
from pyspark.sql import SparkSession
jdbc = '/home/jovyan/postgresql-42.2.17.jar'
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", jdbc) \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://db:5432/postgres") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "timing") \
.option("user", "postgres") \
.option("password", "postgres") \
.load()
df.show()
+---+--------------------+
| id| updated|
+---+--------------------+
| 1|2020-10-15 05:53:...|
| 2|2020-10-15 05:53:...|
| 3|2020-10-15 05:53:...|
+---+--------------------+