Introduction to Distributed Computing

Distributed computing technology enables the compute load to be spread, or distributed, across multiple nodes (computers) connected via a network. The networked machines share the same goal and share the compute load to effectively collaborate and provide the resources to obtain that goal.

Early examples of distributed architecture include Napster, where multiple computers served the common goal of fulfilling a request to download a file. While there are various examples of distributed computing (the biggest being the internet itself), the need for distributed computing in modern machine learning applications is driven by a few key reasons:

The need to process ever-growing volumes of data
The problem with scaling up computers/resources by adding more memory and processors
The need to create fault-tolerant applications

Apache Spark

Presently, Apache Spark is the gold standard framework for distributed, in-memory, general-purpose cluster computing. It efficiently partitions (or shards) a given dataset for data transformation or statistical modeling.

There are various ways teams can leverage the power of Apache Spark and distributed computing. The most traditional solutions are deployed using Apache Spark on on-premise big data clusters — but that involves buying, acquiring, managing, and maintaining expensive hardware.

Cloud technologies such as Azure have made it easier for users to use Spark and distributed computing by deploying big data clusters on cloud, taking away the need to buy, acquire, and maintain the clusters.

Introducing Databricks

Databricks is a service that effectively removes all management burdens that accompany a Spark cluster. Furthermore, it enables streamlined collaboration, operationalization, and custom multi-language programming through its “notebook” interface.

Databricks doesn’t necessarily remove the need for domain expertise, but it is not a new domain or technology — it simply runs Spark under the hood, providing more frequent feature updates, elasticity, and geographic coverage.

For a practical walkthrough of Databricks in action, see Processing Real-Time Streams in Databricks – Part 1.

Introduction to Distributed Computing

Apache Spark

Introducing Databricks

How I Used Kiro to Catch Up on 4 Weeks of PTO in 10 Minutes

My Claude Pro Subscription Paid for Itself Twice During Four Weeks of Personal Leave

GenAI – Demystifying the Concept