We recently replaced an old Kubernetes cluster in Azure AKS with a new set of notebook jobs in Azure Databricks. Overall the migration was a success but we definitely had to live through some growing pains.
Databricks is a for profit company founded by the creators of Apache Spark. They offer a distributed compute platform with the same name which builds on top of and improves Apache Spark. Azure and AWS both offer managed Databricks services in the cloud.
Support
It took us about 6 months for the full migration with the replacement running in production and the old system turned off. Through that time we had several issues that we were fighting against so we opened Azure support tickets to get help. We quickly realized there isn’t much expertise at Microsoft for Databricks but they do work closely with support/engineers at Databricks to provider support. Out of the 3-4 support tickets we’ve opened for Azure Databricks, we ultimately got the answers through assistance with the support or engineers at Databricks.
Ultimately we got it working but support was painful compared to the support for 1st class services in Azure. For Azure Databricks workspaces, Azure is basically just providing scalable VMs, storage and low level cloud resources that optimized to run Databricks.
Global Redundant Storage (GRS) vs Local Redundant Storage (LRS)
When you create a Databricks workspace in Azure it uses an ARM template to create a managed resource group that can’t be edited. This means you can’t edit the resources in the group. The ARM template creates a GRS storage account as a safety measure. If the primary region goes down, all the data is available in a second region.
The down side can be the cost in some situations. In our case we found the GRS operations and storage ended increasing the cost of our workspace by 4x. This is an example of the costs for one partial day of a test workspace:
Notice how the GRS related costs are higher than the VM costs, significantly higher but why? These high costs were roughly the same between our PROD and DEV environment which is odd because PROD has 100x more data flowing through it. So it must be some type of fixed overhead that isn’t effected by the inputs to the notebook.
Using join()
Through experimentation with our current production notebook we isolated the high storage usage/cost to the join() operation. We also built a sample notebook to prove our theory that this had nothing to do with the other operations/data around the join. Here is the very simple notebook command we created:
Its creating two in memory streams of incrementing values and then joining them. So no complex triggers or streams from outside sources, basically just a join of in memory data.
Here you can see when we start the job the storage reads jump to 35,000 per minute and the writes jump as well. We saw this same level of reads/writes on our PROD job as well.
Around 12:25 we commented the join line and simply displayed the streams without a join and the reads/write went to almost nothing. Then re-enabled the join and the storage operations jumped up again.
Solution or Workaround?
After several days researching, the Azure/Databricks support team said this is by design. The join operation has to pull data back from the nodes to the driver for the actual join operation and then distribute the data back out to the nodes. This transfer of data is done through temp files in dbfs. These storage operations are then copied over via GRS for redundancy which increases costs.
There is an option to manually create the Databricks workspace via an customized ARM template. There are some examples of how to do this but I would be worried that you may not get the same level of support if you build it custom.
We decided to rework our notebook instead to not use the join operation. We only use it in one place and think it should be a small amount of effort to rework it. We believe removing the join() call will reduce our costs by +75% which makes it well worth it.
Disappointed
I was disappointed that supports answer was this was by design. Its basically a landmine in Databricks that will explode the costs. Most users will probably never understand why their costs are so high when they hit this issue. The support rep did say Databricks may look for ways to make the join more efficient but I’m not holding my breathe. For now my suggestion is to not use join().
I hope this post was helpful, if so or you have questions/ideas please leave a comment below to continue the conversation.
Cheers!