Introduction
Take your Spark Big Data workloads to the next level by leveraging the power and flexibility of Kubernetes.
In this article, we are going to learn why it may be relevant to use Spark on Kubernetes and how to do it.
For the sake of our demo we will be using a Minikube Cluster to run a basic Spark-Pi job.
Why use Spark on Kubernetes ?
Apache Spark is a framework that can quickly perform processing tasks on very large data sets, and Kubernetes is a portable, extensible, open-source platform for managing and orchestrating the execution of containerized workloads and services across a cluster of multiple machines.
Running Spark on Kubernetes is a winning combo for the following reasons
Containarization and Dependency Management
- Managing dependencies in Hadoop is complex. Packages have to be copied on all nodes in a cluster. Thenceforth, having two different versions of the same software coexist in the same node / updating environments can become challenging.
- The main motivation for using Kubernetes itself resides within the power and flexibility offered by containarization technology. With Docker containers, your application is more portable…