About 81% of enterprises employ a multi-cloud strategy, with the typical organization leveraging between four and five different clouds. Cloud neutrality and portability of apps is critical in a multi-cloud environment to avoid wasted time debugging, diagnosing problems, and deploying in different environments. Containers have proven to be the simplest and most effective solution to this problem. They can be run anywhere, provide a consistent, predictable runtime environment, and use far fewer resources than virtual machines. That’s why Google has been using them in-house for years, along the way developing Kubernetes, a container orchestration system. Now open source and the de facto standard for managing containerized environments, Kubernetes can be deployed in any cloud environment, whether private, public, or hybrid.
Containers and Kubernetes are a natural fit for Apache Spark, an open source parallel processing framework used for large-scale data analytics and machine learning workloads.
Why use Apache Spark with Kubernetes?
Apache Spark requires a distributed storage system and a cluster manager: either the standalone mode, Mesos, or YARN. However, as cloud computing, and particularly cloud containers, became increasingly popular, organizations began clamoring for support to run their data processing and machine learning workloads in Kubernetes clusters to streamline their data processing and ML workflows.
Kubernetes allows users the choice of continuing to use Hadoop as a data source while eliminating the need for Mesos or YARN. Users also enjoy many additional benefits, including:
- True portability across any cloud. Kubernetes was designed to run anywhere, and it greatly simplifies lift and shifts by automating many of the operational tasks associated with app deployment.
- Persistent volumes that decouple apps from the underlying storage system, which are particularly important for Spark and other big data apps, which require a lot of storage space and consistent resources.
- Consistent, reliable dependency packaging. Spark apps are heterogenous and utilize many third-party libraries, such as those in Java, Scala, and Python. Kubernetes allows all dependencies to be placed into a single container image, so that Spark developers can determine dependencies once instead of dealing with constant changes.
- Automatic scaling of services based on utilization so that each team is using only the resources they need, when they need them.
Native Kubernetes integration with Spark applications became available beginning with Spark 2.3.0. Google’s new Kubernetes Operator for Apache Spark, also known as the Spark Operator, utilizes this native Kubernetes integration to run, monitor, and manage the lifecycle of Spark applications within a Kubernetes cluster on Google Cloud Platform (GCP). Google notes that the Spark Operator, available now in beta, is a Kubernetes custom controller that uses custom resources for declarative specification of Spark applications. It allows for fine-grained lifecycle management of Spark applications, including support for automatic restart using a configurable restart policy, the ability to run cron-based, scheduled applications, and improved elasticity and integration with Kubernetes services such as logging and monitoring.
Because the Spark Operator allows declarative specifications to describe Spark applications and manage them with native Kubernetes tooling such as kubectl, users enjoy a common, simplified control plane for managing different kinds of workloads on Kubernetes and improved resource utilization. The Spark Operator also supports Spark 2.4, which includes major enhancements to native Kubernetes integration, such as support for PySpark and SparkR applications, client mode support for interactive applications and data science notebooks such as Jupyter, and support for mounting certain types of Kubernetes volumes.
Google has big plans to enhance and improve the Spark Operator throughout 2019, including:
- Making the Spark Operator compatible with different versions of Spark. As it stands now, a Spark Operator that works with Spark 2.3.x will not work with Spark 2.4.
- Adding Kerberos authentication, starting with Spark 3.0.
- Enhancing the Spark Operator’s ability to run production batch processing workloads through priority queues and basic priority-based scheduling.
The Spark Operator is available for quick installation right now from the GCP Marketplace, where it will immediately and seamlessly integrate with other GCP services.