Spark and Kubernetes, A Match Made in Heaven

3 min readFeb 28, 2021

Spark on Kubernetes

What is Apache Spark?

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

It’s faster than previous approaches to work with Big Data like classical MapReduce. The secret for being faster is that Spark runs on Memory (RAM), and that makes the processing much faster than on Disk.

It can be used for multiple things, like running distributed SQL, create data pipelines, ingest data into a database, run Machine Learning algorithms, work with graphs, data streams, and much more.

Spark needs a cluster manager (Scheduler) to parallelize workload.

The main cluster managers for Apache Spark are:

Standalone
Apache Mesos
Hadoop YARN
Kubernetes

What is Kubernetes?

Kubernetes is an open-source production-ready container-orchestration platform that is used for the deployment, scaling, and management of large-scale applications.

The primary advantage of using Kubernetes in your environment, especially if you are optimizing app dev for the cloud, is that it gives you the platform to schedule and run containers on clusters of physical or virtual machines (VMs).

Why Run Spark on Kubernetes?

They both are built to allow the users to run their applications on a large scale without slowing them down.

So it makes perfect sense to use Kubernetes and Spark together.

High Isolation: Application runs inside the containers, preventing them from interfering with other applications.
Dependency Management: The container contains all the dependencies which provide better dependency management in contrast to Hadoop where it is difficult to update the environment.
Same Performance: There is almost negligible difference in comparison to YARN while providing the other benefits.
Cost Efficient: The cost is comparatively low because of better resource sharing and high isolation provided by Kubernetes Environment.
Cloud Agnostic: Kubernetes runs on Amazon Web Services (AWS), Microsoft Azure, and the Google Cloud Platform (GCP), and you can also run it on-premise. You can move workloads without having to redesign your applications or completely rethink your infrastructure — which lets you standardize on a platform and avoid vendor lock-in.

Spark on Kubernetes:

https://spark.apache.org/docs/latest/running-on-kubernetes.html

spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. The submission mechanism works as follows:

Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors, which are also running within Kubernetes pods and connect to the executors, and executes application code.
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in a “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

There are two ways to submit Spark applications to Kubernetes:

Using Spark Operator which is an open-source product developed by GCP.

We can use this to run our Spark Application as Kubernetes objects with a simple to use configuration.

2. Using spark-submit command directly, that will interact with Kubernetes API which in turn will handle the execution.

Spark’s Native Kubernetes Scheduler (Available from spark 2.3) is currently in the experimental phase which is planned to be added in General Availability in the next release.

Thinking of moving to the Cloud?

Let us help you on your cloud journey. Reach out to us at Cluephant.

Achieve FASTER, BETTER results with Cluephant (Cloud Natives).