Member-only story

Spark on Kubernetes

Chouaieb Nemri

7 min readJun 25, 2021

Introduction

Take your Spark Big Data workloads to the next level by leveraging the power and flexibility of Kubernetes.

In this article, we are going to learn why it may be relevant to use Spark on Kubernetes and how to do it.

For the sake of our demo we will be using a Minikube Cluster to run a basic Spark-Pi job.

Why use Spark on Kubernetes ?

Apache Spark is a framework that can quickly perform processing tasks on very large data sets, and Kubernetes is a portable, extensible, open-source platform for managing and orchestrating the execution of containerized workloads and services across a cluster of multiple machines.

Running Spark on Kubernetes is a winning combo for the following reasons

Containarization and Dependency Management

Managing dependencies in Hadoop is complex. Packages have to be copied on all nodes in a cluster. Thenceforth, having two different versions of the same software coexist in the same node / updating environments can become challenging.
The main motivation for using Kubernetes itself resides within the power and flexibility offered by containarization technology. With Docker containers, your application is more portable…

Spark on Kubernetes

Introduction

Why use Spark on Kubernetes ?

Containarization and Dependency Management

Written by Chouaieb Nemri

No responses yet