Most of the customers I talk to are directly of indirectly asking to to scale their workloads and use Databricks. It has become the new normal in the data processing in cloud. If you are using or plan to use Azure Databricks, this post is will guide you on some interesting things that you can plan to investigate as you start. I have not touched any technology in this but is good for developers/architects to know these things
-
Use interactive cluster: Teams spend lot of time playing with data and exploring the patterns. There is a certainly a need to have a unobstructed compute and horsepower available while users are exploring the data, or while developers are working on the notebooks. It’s great to have a interactive cluster available for developers or end users for such scenarios.
-
Use job clusters: while your jobs or notebooks are running in production there are optimizations around costs that can be achieved using job clusters. These clusters spin up for the job run-time only, provide the compute and decommission automatically once the job is done. Whether you are scheduling in azure Databricks or orchestrating from tools such as Data factory, job clusters provide a great way to optimize cost and resources in production
-
Use shortcuts: there are lot of shortcuts available in notebooks and the list can be accessed from within the notebooks. You won’t remember all but here is list of my favorite ones which makes thing super easy
-
Shift +Enter: use this to run the command and switch the control to next cell. Best thing is that it inserts a new cell if you are at the end of the notebook
-
Ctrl + /: this is by far the most used shortcut. This comments/ un-comments the code in the cell. Best thing is that, depending upon on the magic commands you used it uses the right comment format (either ‘/’ or ‘- -’ or ‘#’) for the language
-
Hold Shift key: Hold the key while deleting the cell. This will stop annoying pop up confirmation to delete the cell
-
-
Use magic commands: I like switching the cell languages as I am going through the process of data exploration. Having come from SQL background it just makes things easy. And there is no proven performance difference between languages. All languages are first class citizens. Feel free to toggle between scala/python/SQL to get most out of Databricks. Refer this link for more
-
Use Databricks Delta: this is by far the best feature of the technology that is going to change the way data lakes are perceived and implemented. Delta provides seamless capability to upsert and delete the data in lake which was crazy overhead earlier. Using delta is going to change how lakes are designed.
For more info on delta and delta lake. Look for my next blog.