Distributed computing technology enables the compute load to be spread, or distributed, across multiple nodes (computers) connected via a network. The networked machines share the same goal and share the compute load to effectively collaborate and provide the resources to obtain that goal.
Early examples of distributed architecture include Napster, where multiple computers served the common goal of fulfilling a request to download a file. While there are various examples of distributed computing (the biggest being the internet itself), the need for distributed computing in modern machine learning applications is driven by a few key reasons:
- The need to process ever-growing volumes of data
- The problem with scaling up computers/resources by adding more memory and processors
- The need to create fault-tolerant applications
Apache Spark
Presently, Apache Spark is the gold standard framework for distributed, in-memory, general-purpose cluster computing. It efficiently partitions (or shards) a given dataset for data transformation or statistical modeling.
There are various ways teams can leverage the power of Apache Spark and distributed computing. The most traditional solutions are deployed using Apache Spark on on-premise big data clusters — but that involves buying, acquiring, managing, and maintaining expensive hardware.
Cloud technologies such as Azure have made it easier for users to use Spark and distributed computing by deploying big data clusters on cloud, taking away the need to buy, acquire, and maintain the clusters.
Introducing Databricks
Databricks is a service that effectively removes all management burdens that accompany a Spark cluster. Furthermore, it enables streamlined collaboration, operationalization, and custom multi-language programming through its “notebook” interface.
Databricks doesn’t necessarily remove the need for domain expertise, but it is not a new domain or technology — it simply runs Spark under the hood, providing more frequent feature updates, elasticity, and geographic coverage.
For a practical walkthrough of Databricks in action, see Processing Real-Time Streams in Databricks – Part 1.