Cost Reduction Strategies for Databricks – Future

This article was done with OpenAI O1 – DeepResearch and 4 iterations over the content and cases. The prompts use can be found the end of the article.

1. Compute Optimization

Serverless vs. Cluster Pools vs. Dedicated Clusters

Serverless compute in Databricks automatically provisions and scales resources on demand, incurring charges only when work is being done (Serverless vs pool : r/databricks). This model is highly cost-effective for intermittent or bursty workloads since you pay nothing when clusters are idle (Serverless vs pool : r/databricks). It also simplifies management (no manual sizing) and can scale out quickly. However, serverless DBUs (Databricks Units) may be slightly more expensive per runtime hour, so if your workload runs nearly 24/7, a long-running cluster could be cheaper (Serverless vs pool : r/databricks). In constant-use scenarios, you might benefit from reserved VM pricing on a dedicated cluster, whereas serverless shines when usage is sporadic or unpredictable (Serverless vs pool : r/databricks). Serverless SQL warehouses are especially useful for bursty BI queries – they start nearly instantly and auto-suspend when idle to save cost (Databricks Cost Optimization: A Practical Guide | overcast blog). For example, Databricks notes that an interactive SQL Warehouse is “the most cost-efficient engine” for SQL workloads and comes tuned with Photon for better price/performance (Best practices for cost optimization – Azure Databricks | Microsoft Learn).

Cluster Pools keep a set of VMs pre-started as a “warm pool” to reduce cluster spin-up and auto-scale times (Best practices for cost optimization – Azure Databricks | Microsoft Learn). The advantage is primarily reduced latency – jobs or notebooks can attach to a cluster much faster using pool instances. This can indirectly save cost by cutting unproductive wait time (clusters begin work sooner) and by allowing frequent job clusters to start/stop without long delays. Importantly, Databricks does not charge DBUs for idle instances sitting in the pool (Best practices for cost optimization – Azure Databricks | Microsoft Learn). You only pay the cloud VM cost for those idle nodes, not the Databricks markup, which is a savings on the Databricks side (Best practices for cost optimization – Azure Databricks | Microsoft Learn). However, since the VMs are running, you are still incurring infrastructure costs while they wait. In essence, pools let you simulate some benefits of serverless using your own managed fleet (Serverless vs pool : r/databricks) – you pay a bit extra to keep instances ready, in exchange for instant scalability. This is ideal if you have many short jobs throughout the day. If pool nodes remain idle too long, costs can add up, so you should right-size the pool (or set it to 0 when not in use) to balance responsiveness vs. cost.

Dedicated Clusters per User (i.e. each data scientist or engineer running their own all-purpose cluster) are the simplest to isolate workloads, but also the least cost-efficient in many cases. Each such cluster has its own driver and potentially idle executors, leading to low overall utilization if users aren’t running heavy workloads continuously. Moreover, interactive all-purpose clusters carry a higher DBU rate than job-specific clusters (Best practices for cost optimization – Azure Databricks | Microsoft Learn) – running notebooks on a personal cluster can cost significantly more than using a job cluster for the same work (Best practices for cost optimization – Azure Databricks | Microsoft Learn). If multiple users each have a small cluster running, consider that you are paying for N separate sets of infrastructure. Unless each user consistently maxes out their cluster, this approach wastes resources. For example, 5 users with 5 small always-on clusters will generally cost more than sharing a single larger cluster among 5 users with comparable total resources. Recommendation: Favor serverless or job-scoped compute for ad-hoc workloads and use pools for repeated jobs, rather than giving every user a long-lived personal cluster. Only keep dedicated clusters for special cases (e.g. if a particular workload needs unique libraries or security isolation). As one Databricks expert succinctly put it: with pools you’re paying for capacity even when idle, whereas with serverless you pay nothing when idle (Serverless vs pool : r/databricks). The key trade-off is control vs. convenience: dedicated clusters give each user full control but at a high cost for idle time, while serverless/pools improve overall efficiency at the expense of some control over the underlying instances.

Shared Clusters for Multiple Users

Instead of siloing each engineer or scientist on their own cluster, it’s often feasible to share clusters to increase utilization. Azure Databricks allows multiple users to attach to the same cluster and perform interactive analysis collaboratively (Cost Breakdown of Databricks by job or user – Microsoft Q&A). By using a shared all-purpose cluster for a team, you reduce the total number of running clusters (for example, one cluster serving 5 users rather than 5 separate clusters). This means fewer idle cores sitting around and less duplicate overhead (you avoid paying for 5 drivers when one can serve everyone). In practice, a shared medium-size cluster can handle the notebook workloads of several users, scaling up if needed to accommodate concurrent tasks. This approach is cost-effective because Spark can schedule work for all users on the same set of nodes, improving overall resource utilization.

The trade-offs with shared clusters are manageability and performance isolation. Users have to agree on a base environment (installed libraries, configurations) on that cluster. If one user’s job is very demanding, it could consume most of the cluster, potentially slowing others’ work. These issues can be mitigated by enabling fair scheduling in Spark and setting cluster autoscaling with a sufficiently high max to handle peak load. In our experience, having one cluster for a small group (e.g. all data scientists) works well during business hours – the cluster can scale out during a burst of activity and scale back down when users are idle. It should also be configured to auto-terminate after hours to avoid overnight idling (Best practices for cost optimization – Azure Databricks | Microsoft Learn). Shared clusters dramatically cut down the number of underutilized instances, at the cost of some complexity in coordination. For many teams, this is a reasonable trade-off, yielding a direct reduction in compute-hours billed. (Multiple users on one cluster collectively incur the same DBU rate as that one cluster, which is often far less than the sum of each user running their own cluster.)

Autoscaling and Spot Instances for Jobs

For batch or stream workloads like micro-batching with UDFs, leveraging autoscaling and spot instances can yield substantial savings:

Autoscaling: Databricks clusters can be set to scale their number of worker nodes up or down based on workload demand (Best practices for cost optimization – Azure Databricks | Microsoft Learn). For example, you might configure a job cluster with 1–8 workers; if the micro-batch workload spikes (e.g. a large input batch arrives or a UDF becomes more CPU-intensive), the cluster will scale out up to 8 nodes. When processing slows or completes, it can scale back to the minimum (even 0 for job clusters, or 1 for streaming) to avoid paying for idle capacity. This ensures you’re only paying for the compute power you actually need, when you need it (Databricks Cost Optimization: A Practical Guide | overcast blog). Particularly for variable or bursty workloads, autoscaling prevents over-provisioning. Without it, you might size a cluster for peak load and pay for that size even during lulls. With autoscaling, the cluster automatically adjusts, which smooths out utilization and cuts costs during low-demand periods. It’s recommended to also enable auto-termination (e.g. terminate after 30 minutes of idleness) so that even the minimal cluster shuts down when not in use (Best practices for cost optimization – Azure Databricks | Microsoft Learn). The trade-off is a slight lag when scaling up (it takes a short time to acquire new VMs), so very spiky workloads could experience minor delays. Overall, autoscaling is a best practice for cost – it “ensures you only pay for the compute power you need” (Databricks Cost Optimization: A Practical Guide | overcast blog), at the expense of some real-time control over cluster size.
Spot Instances: Azure Spot VMs are spare compute capacity offered at deep discounts (up to 90% cheaper than on-demand prices) (Azure Spot Instances for Databricks AI | Databricks Blog). Databricks allows using spot instances for worker nodes in a cluster. This can massively reduce the cost of running batch jobs. For example, if your micro-batching job normally needs 5 standard VMs, you could run with 5 spot VMs and potentially pay only a fraction of the price. According to Databricks, using Spot VMs for Azure Databricks clusters can save “up to 90% on compute costs” (Azure Spot Instances for Databricks AI | Databricks Blog). The caveat is that spot instances can be evicted by Azure with short notice if capacity is needed elsewhere. Databricks mitigates this by automatically replacing evicted spot nodes with on-demand nodes to ensure the job completes (Azure Spot Instances for Databricks AI | Databricks Blog). Still, jobs must tolerate the loss of a worker (Spark will rerun the lost tasks). Micro-batch processing with idempotent UDFs is typically a good candidate for spot, because if a batch fails or a node drops, the stream can retry that batch. If the workload is extremely latency-sensitive (e.g. real-time streaming with tight SLAs of seconds), spot interruptions could be problematic – one guide notes to avoid spot for real-time or transactional workloads where any disruption is costly (Databricks Cost Optimization: A Practical Guide | overcast blog). But for most ETL, batch, or streaming jobs with some tolerance, spot is a great way to cut costs. You can also use a mixed strategy: for instance, run 80% of workers as spot and 20% as on-demand to balance cost and reliability (Databricks Cost Optimization: A Practical Guide | overcast blog). In practice, enabling a spot strategy might mean occasional slightly longer job runtimes (when an eviction happens), but huge cost savings over time.
Combining Autoscaling + Spot: These features work very well together. You might configure a cluster to autoscale between 1 and 10 nodes, all of which are spot instances. During low load, maybe 1 spot node runs; during a peak, it scales to 10 spot nodes. You pay spot prices for whatever scale you use, minimizing cost at every step. If the cluster needs to scale up quickly, using an instance pool in conjunction can help – the pool can maintain a few idle spot VMs ready to attach, so autoscaling doesn’t have to wait for Azure to spin up new VMs (Azure Spot Instances for Databricks AI | Databricks Blog). This hybrid approach ensures capacity comes online fast and cheaply. The overall impact is that even computationally heavy UDF workloads can run efficiently: when they need more power they get it (thus maintaining performance), but you’re paying the lowest possible rate for that power, and not paying when it’s not needed.

In summary, take advantage of autoscaling to handle the variable throughput of micro-batching jobs and use spot instances to drastically cut the per-instance cost for those jobs. These strategies can be implemented via cluster policies (enforcing autoscaling on clusters, disallowing fixed oversized clusters, etc.) (Best practices for cost optimization – Azure Databricks | Microsoft Learn). The potential savings are significant – for example, using spot VMs in Databricks has been shown to save up to 34%+ just by reducing runtime costs in one case study (Databricks Cost Optimization: A Practical Guide | overcast blog) (and potentially much more, depending on spot price and availability). The trade-off is mostly around reliability: as long as your jobs can handle the dynamic nature (which most batch jobs can, with Spark’s retry mechanisms), the cost benefits far outweigh the occasional inconvenience.

2. Local Execution & Deployment Optimization

Databricks Asset Bundles for Local Development

Databricks Asset Bundles (DABs) are a new tool to facilitate developing and deploying workloads with software-engineering rigor. A Databricks Asset Bundle packages your code (notebooks, Python modules, etc.) along with configuration for jobs, clusters, libraries and other resources, all in a structured format (Develop a job on Azure Databricks using Databricks Asset Bundles – Azure Databricks | Microsoft Learn). In practice, a bundle is defined by YAML/JSON config files and your project code, which together describe everything needed to run a job on Databricks. Using DABs can significantly improve cost efficiency during development in several ways:

Local Coding and Testing: With DABs, you can develop code in your favorite IDE on your local machine, including writing unit tests for your functions. The bundle concept encourages treating notebooks and jobs as code artifacts that can be version-controlled and tested offline. Data practitioners can thus iterate locally (which incurs no cloud cost) and only execute on a Databricks cluster when necessary. This contrasts with the traditional approach of developing directly in a notebook on a cluster (incurring cloud costs for every experiment). In other words, DABs enable an offline-first workflow where you “adopt best practices in software development” and only use the cloud for integration testing and production runs (Simplify your workflow deployment with Databricks Asset Bundles: Part I – Xebia).
Efficient Deployment and CI/CD: Asset Bundles integrate smoothly with CI/CD pipelines (Simplify your workflow deployment with Databricks Asset Bundles: Part I – Xebia). You can validate a bundle configuration and even run it in a staging environment via the command-line or CI job. For example, a developer might deploy their bundle to a small development Databricks workspace to test it, then promote the same bundle to production via an automated pipeline (Simplify your workflow deployment with Databricks Asset Bundles: Part I – Xebia). This automation means you can spin up ephemeral job clusters for testing and tear them down automatically. It avoids long-lived dev clusters. The deployments are scriptable and repeatable, so there’s less guesswork and less need to keep a cluster running “just in case” someone needs to rerun a notebook. According to Microsoft’s tutorial, data engineers can create a bundle for a job, validate it locally, then deploy and run it in the cloud workspace programmatically (Develop a job on Azure Databricks using Databricks Asset Bundles – Azure Databricks | Microsoft Learn) (Develop a job on Azure Databricks using Databricks Asset Bundles – Azure Databricks | Microsoft Learn). By doing so, you only incur cloud costs during the actual job run – all the packaging and validation can be done locally. This can lead to significant savings over time, especially if your team previously did a lot of interactive tweaking on clusters.
Bundle of Libraries and Config: Another benefit is that Asset Bundles encapsulate library dependencies and environment setup. This reduces the need to manually configure interactive clusters for development. For example, if your job needs certain Python packages, those can be defined in the bundle. Developers can test that environment locally (using tools like venv or Conda) and be confident that when the bundle deploys to Databricks, it will bring those dependencies. This avoids the common pattern of “keep a cluster running with my libraries installed so I can play around” – a pattern that costs money. Instead, you deploy the environment on-demand with the job. In short, DABs promote an “infrastructure as code” mindset for Databricks, where you spin up resources only when needed and consistently through automation.

Overall, using Databricks Asset Bundles can streamline the move from development to production and minimize the time spent using expensive interactive clusters. Data engineers can do most of their work locally and treat the Databricks runtime as a deployment target, not a development sandbox. This approach might require an initial investment in setting up the bundle configurations and CI/CD, but it pays off by eliminating manual, long-running dev clusters. The process can be summarized as: develop locally, deploy to a dev workspace for testing, then promote to prod – all driven by the bundle (Simplify your workflow deployment with Databricks Asset Bundles: Part I – Xebia). This keeps cloud usage lean and purposeful.

Using Local or Hybrid Compute for Development

Another strategy to cut costs is to perform as much development and testing as possible off-cluster (locally or on cheaper resources) before using Databricks compute. Some techniques and tools that enable this:

Local Spark Testing: Apache Spark can run in a local mode on a developer’s machine (using a single node or multi-threading for parallelism). Engineers can leverage this to test Spark code on small samples of data without launching a cluster. For instance, PySpark or Spark with local master (spark://local[*]) can execute the same code that would run on a cluster, just at a smaller scale. This approach has clear cost benefits – you utilize your local CPU/RAM (which is free) for initial development. As one blog notes, a local Spark setup provides “various benefits such as cost savings and isolated working” for developers (Local Development using Databricks Clusters – Pivotal BI). It won’t handle large data, but it’s usually sufficient for logic testing and unit tests. By catching bugs and iterating on code locally, you avoid wasting cluster time. Many teams combine this with small test data files on their laptop or a development VM.
Databricks Connect: Databricks Connect is a client library that allows you to write Spark code on your local machine and execute against a remote Databricks cluster. It essentially connects your local Spark session to the Databricks cluster’s Spark backend (Local Development using Databricks Clusters – Pivotal BI). This can be considered a hybrid approach: you code and do some processing locally, but heavy operations happen on the cluster. The benefit is you can develop in an IDE with a local environment, and only when you trigger an action does it use the cluster’s resources. This might not directly reduce cluster usage compared to working in a notebook (since the cluster still runs the job), but it allows sharing one cluster among multiple developers or tests. For example, an engineer can connect to a large shared dev cluster, run a few tests, then disconnect – all without the overhead of managing notebooks on the workspace. One must ensure the local and cluster Spark versions are compatible (Local Development using Databricks Clusters – Pivotal BI), which is a noted caveat, but it can still improve productivity. The cost angle here is that you might maintain a single always-on small cluster for dev/testing that everyone connects to, rather than each developer running their own cluster. It’s a form of consolidation enabled by remote connect. However, with the advent of Asset Bundles and improved notebook experience, Databricks Connect is less emphasized now than before (it was mainly useful prior to DBR 13). Still, it’s an option in certain scenarios.
Single-Node (Ephemeral) Clusters: If you do need to use Databricks for development (e.g., to access large data or specific environment), consider using a small single-node cluster or otherwise minimal cluster. Databricks allows clusters with zero workers (just a driver node that does all the work). This is effectively like running Spark on one node in the cloud. For many development tasks, a single node (with 8–16 cores and some RAM) is sufficient and much cheaper than a multi-node cluster. It also avoids the overhead of distributed shuffles for small workloads. In fact, Databricks recommends using a single-node cluster for initial experiments in some cases, because “having fewer nodes” can simplify things (and reduce overhead) (Compute configuration recommendations – Azure Databricks | Microsoft Learn). For example, a data scientist exploring a new dataset could attach their notebook to a single node cluster that auto-terminates in 30 minutes. The cost might be, say, 1 node * 30 min of usage, rather than 4 nodes * several hours if they had a larger cluster sitting idle. Once the code is working and needs to be run on the full dataset, you can switch to a bigger cluster or a job run. Automating the startup of these small clusters (possibly via APIs or CLI) on demand can integrate this into the workflow so that you’re not keeping even the single node up longer than needed.
Leverage Free or Low-Cost Tiers: If suitable, you could use the Databricks Community Edition for some development. Community Edition is a free Databricks workspace with a small 15 GB cluster available (Databricks Community Edition FAQ). It has limitations (Spark 3.x, limited compute power, no real data integration with your Azure storage), so it’s not an option for most professional workflows. But for pure Spark API experimentation or learning, it’s a zero-cost environment. In most team settings, a better approach is to have an Azure dev/test subscription with a smaller Databricks workspace that is only active during work hours.

In summary, prioritize local execution whenever possible: debug your UDF logic on a local Spark session, run unit tests on your functions outside of Spark, and only use a cluster for scaling up or integration tests. When you do need a cluster, use the smallest resource that gets the job done (and ensure it auto-shuts when idle). By reducing the time spent on large interactive clusters for development, teams have seen significant cost reductions. One team reported moving much of their dev work off Databricks notebooks, using local IDE with dbx (an SDK similar to Asset Bundles), and only deploying to an ephemeral job cluster when ready – this eliminated the need for long-running dev clusters entirely, cutting their interactive compute costs to near-zero (aside from actual job runs). The trade-off here is that setting up local environments and CI pipelines requires effort, and developers need the discipline to not default to “just use the big cluster because it’s there.” But with modern tools, the developer experience can remain smooth while following this cost-efficient model.

3. Cost-Effective Instance and Storage Optimization

Best Instance Types for Micro-Batching Workloads

Choosing the right VM instance type for your Azure Databricks clusters is crucial for cost optimization. Different workloads benefit from different hardware profiles, and picking a cost-effective instance can save money while meeting performance needs:

General Purpose vs. Compute Optimized: Micro-batching workloads (small batches of data processed frequently, often with CPU-bound UDFs) typically benefit from higher CPU throughput rather than extremely large memory. For such jobs, compute-optimized instances (with more vCPUs per unit of memory) can give better price/performance. On Azure, families like Fsv2 series offer high CPU clock speeds and a high core-to-memory ratio at lower cost – these can be ideal if your UDFs are heavy on computations but don’t require loading huge datasets in RAM. General purpose VMs (e.g. Dv3/Dv4 series) are a balanced choice if you need a mix of decent CPU and memory. They tend to be cost-effective for a wide range of workloads and often are considered “cost-efficient” default options (Best practices for cost optimization – Azure Databricks | Microsoft Learn). In contrast, memory-optimized instances (e.g. Ev4, which have large RAM per core) might be overkill for micro-batches unless you’re doing something like windowing or aggregation that needs big in-memory state. Since memory-optimized nodes are pricier, you should use them only if you’ve identified memory as the bottleneck. Always start with the smallest instance size that can handle your workload’s memory footprint and scale out from there rather than using a few very large nodes that sit half-empty.
Avoid Unneeded Premium Hardware: Steer clear of expensive specialized instances unless your workload truly requires them. For example, GPU-enabled instances are significantly more costly and are only beneficial for specific tasks (like deep learning training). The vast majority of Spark jobs (ETL, SQL, UDFs) do not use GPUs at all, so using a GPU VM would waste money (Best practices for cost optimization – Azure Databricks | Microsoft Learn). Azure Databricks documentation explicitly advises restricting GPU machines unless you know you need them (Best practices for cost optimization – Azure Databricks | Microsoft Learn). The same logic applies to other specialized VMs: e.g. very high memory VMs (like the M series with hundreds of GB of RAM) or very IO-heavy VMs (Ls series with NVMe disks) should be chosen only if your jobs bottleneck on those resources. If your micro-batch job is mostly CPU-bound, an F-series 8-core VM at ~$X/hour is more economical than an E-series 8-core at ~$X*1.5/hour that includes extra memory you won’t fully utilize. Cluster policies can be used to enforce this, by allowing only certain instance types for user clusters (ensuring nobody accidentally selects a very costly VM type) (Best practices for cost optimization – Azure Databricks | Microsoft Learn).
Scale Up vs. Scale Out: There’s often a question of using a few large nodes vs. many small nodes for the same total computing power. For example, 8 workers with 4 cores each vs. 2 workers with 16 cores each (both give 32 cores total). Pure cost might be similar if the core/hour price is linear, but performance can differ due to overheads (more nodes mean more network shuffle overhead, but also more parallelism for I/O). For micro-batching, which usually processes data in small chunks quickly, having too many nodes might lead to diminishing returns – the batch might be so small that the overhead of coordinating many executors outweighs the benefit. In such cases, using moderately larger nodes (e.g. 4–8 cores each) can reduce coordination cost. Databricks notes that two larger workers vs. several smaller ones can have the same raw resources (Compute configuration recommendations – Azure Databricks | Microsoft Learn), so it’s about finding the sweet spot. A good practice is to experiment with cluster sizing: run your workload on different cluster shapes (1 large node, few medium nodes, many small nodes) and observe the runtime and cost. Often, a smaller cluster that just fits the batch will be most cost-efficient, because the job finishes quickly and the cluster can shut down sooner. Also consider that Databricks charges DBUs per node size/instance type – e.g. a very high-end VM might cost more DBUs per hour. Cost-efficient families in Azure (like Dv3, Fsv2) generally have a lower DBU cost rating than extremely large or specialized VMs.
Leverage Newer Generations and Spot: New VM generations (Dv5, Ev5, etc.) often provide better performance at similar or lower price than older ones. If available in Databricks, prefer the latest generation instances as they can do more work per dollar (for example, the Ebdsv5 series can have 300% better disk performance than prior gen for the same cost (Virtual Machine series | Microsoft Azure), which means faster I/O for the same price). Also, remember you can combine instance selection with spot pricing as discussed. For instance, running Fsv2 spot instances could be the cheapest path if your workload tolerates it. One user shared that moving from expensive instance types to a mix of more cost-effective instances plus spot saved their team a lot – “stopped using [premium] i3 machines, and instead used … spot instances” and enforced smaller clusters, combined with aggressive auto-termination (How did you reduce your Databricks costs ? : r/dataengineering) (How did you reduce your Databricks costs ? : r/dataengineering). While that example was AWS-specific, the principle holds: avoid pricy instance families and use discounted pricing options whenever you can.

In summary, optimize your cluster configuration by using instance types that match your workload profile without over-provisioning resources you don’t use. Use cluster policies or guidance to ensure only “cost-efficient VM instances” are selected (Best practices for cost optimization – Azure Databricks | Microsoft Learn). A practical approach is: start with a standard DSv2/DSv3 or FSv2 instance for most jobs, monitor the job’s CPU, memory, and I/O usage, and adjust if needed (scale up instance size if hitting CPU 100% or memory limits, or scale out if CPU is low but job is slow indicating not enough parallelism). By doing this, you can often cut the compute cost for a pipeline by 20-30% just by eliminating waste (for example, we found a job running on huge machines with 10% CPU usage – moving it to smaller machines and more parallelism cut cost dramatically with the same performance).

Optimizing Storage Costs (Delta Lake and Azure Blob)

Storage may not be the biggest line item compared to compute, but inefficient storage directly leads to higher compute costs (scanning more data, waiting on I/O) and can inflate your cloud bill if not managed. Two key aspects here are the data format/management (Delta Lake) and the cloud storage tier (Azure Blob Storage):

Use Delta Lake for efficient storage and IO: Storing your data in Delta Lake format (which is an optimized layer on Parquet) is strongly recommended for Databricks workloads. Delta Lake brings performance enhancements like data skipping, file compaction, and schema evolution, which make reads and writes much more efficient compared to raw formats (Best practices for cost optimization – Azure Databricks | Microsoft Learn). Microsoft highlights that Delta Lake “significantly [speeds] up workloads compared to using Parquet, ORC, and JSON”, directly translating to shorter compute uptime and lower costs for your jobs (Best practices for cost optimization – Azure Databricks | Microsoft Learn). In practice, Delta’s benefits mean your ETL or query jobs finish faster (because less data is read and written), so the cluster can be terminated sooner to save money. Some specific ways Delta helps:
- Small File Compaction: Many big data pipelines produce lots of small files (e.g. each micro-batch writes a file). Reading thousands of tiny files is inefficient and increases the overhead per query. Delta Lake can automatically combine these into larger files, or you can periodically run the OPTIMIZE command to merge small files into chunks of, say, 256 MB. This improves Spark’s throughput and lowers CPU overhead. Managing file sizes in Delta is “key to boosting performance and cutting costs” – too many small files lead to excessive metadata handling and slow queries (Databricks Cost Optimization: A Practical Guide | overcast blog). By optimizing file size, you ensure each task does meaningful work, thus reducing the total runtime. It’s common to see dramatic speedups (and thus cost drops) after compaction of a table that had a small-files problem.
- Data Skipping and Indexing: Delta Lake maintains transaction logs and statistics about data (like min/max values per column per file). This allows Databricks to skip reading files that don’t match a query’s filter, for example. Less data read = less IO and compute. Techniques like Z-Ordering (clustering data on disk by a key) further enhance this. By clustering files on common query keys, Delta can skip even more data. The result is faster queries, meaning you can use a smaller cluster or finish sooner, saving DBUs. It’s noted that using Delta’s optimizations (like z-order, data skipping) “reduce the cost of queries by minimizing the amount of data read” (Databricks Cost Optimization: A Practical Guide | overcast blog).
- Concurrency and Reliability: Delta’s ACID properties reduce the likelihood of job failures due to inconsistent data (no half-written files, etc.). This means you’re less likely to re-run jobs because of data issues, indirectly saving cost by avoiding wasted reruns. Additionally, multiple jobs can read/write Delta concurrently, which can eliminate the need for duplicate copies of data for different workflows (saving storage space).
- Time Travel vs. Data Duplication: Delta’s time travel feature (keeping history of changes) can be a cost saver as well – it can reduce the need to keep full snapshot copies for backups or audits. Instead of storing N versions of a dataset in separate directories (and paying for N× storage), you keep one Delta table with version history. Accessing old data is still possible without the overhead of maintaining multiple datasets (Databricks Cost Optimization: A Practical Guide | overcast blog). Just remember to vacuum old history you truly don’t need, to free storage.
Bottom line: Invest in Delta Lake optimization. The initial cost to run OPTIMIZE or ZORDER commands periodically is easily offset by the subsequent saving in query costs. For example, one might schedule a weekly compaction job (which itself costs some compute) but then all daily queries run 2x faster hence using half the cluster time – net effect is a cost reduction. Using Delta for all major tables also simplifies pipeline design and reduces development effort (which has its own kind of cost).
Optimize Azure Blob Storage usage: Azure Databricks uses Azure Blob Storage (often via ADLS Gen2) to store data (e.g. the DBFS root and mount points). To reduce storage costs, consider the following:
- Tiering Data: Azure Blob offers Hot, Cool, and Archive access tiers. The Hot tier is for frequently accessed data and has the highest storage cost (but lowest access cost), whereas Cool and Archive tiers have cheaper storage but higher read costs and restrictions (e.g. data should remain for 30+ days in cool, 180+ in archive) (Azure Cost Optimisation: 12 Quick Tips for a Lower Bill). For data that you do not access often (old logs, historical raw data, old checkpoints), move it to cool or archive storage. You can save significantly by not paying hot storage rates for cold data – “save massively by aligning storage types with your actual usage” (Azure Cost Optimisation: 12 Quick Tips for a Lower Bill). Implement lifecycle management rules to automatically transition blobs to cooler tiers after a certain age. For example, store the last 30 days of data in Hot, next 90 days in Cool, and anything older in Archive. This way, you’re only paying premium for the data you’re actively using. Many companies find that a majority of their data is “cold” and can be stored cheaply if handled properly.
- Efficient Data Layout: Organize your data in larger files and partitioned directories. This ties into Delta optimization – larger files reduce overhead, and partitioning (by date, for instance) means when you read one day of data, you don’t touch all other days. Efficient layout means fewer total bytes scanned from storage, which not only speeds up jobs but also reduces any transaction costs or read operations on the storage account. While Azure’s storage cost model is mostly capacity-based, very frequent transactions (calls) can incur costs; using partition pruning (only listing relevant folders) mitigates that.
- Compression: Ensure that data is stored in compressed form (Parquet and Delta are compressed by default using codecs like Snappy). Compressed data consumes less space on Blob and also means less data has to be read into the cluster memory. Just be mindful of not using an overly slow compression that burns CPU; Snappy is a good balance usually. Smaller storage footprint directly = lower storage cost.
- Avoid Redundant Copies: It’s easy to let data sprawl happen (e.g., multiple people making copies of the same table for their experiments). Encourage use of Delta’s time travel or shallow clones instead of full duplicates if possible. Or use version control in Lake (e.g., partitions for different versions of data) rather than copying data to new paths. Every extra copy doubles storage costs and can double the compute needed to update or fix data. By cleaning up unused data and consolidating duplicates, you can often trim down your storage costs.
- Geo-Redundancy Choices: Azure storage accounts can be LRS (locally redundant), ZRS/GRS (zone or geo redundant). Higher redundancy increases cost. For non-production or recreatable data (like interim analysis outputs), you could choose LRS to cut cost. Use GRS only for critical data that needs cross-region disaster recovery. This isn’t a Databricks setting per se, but part of Azure storage setup that can affect cost significantly.
Caching and Data Access Patterns: While not a direct “cost” in terms of billing, using Databricks caching (Delta cache or Spark’s CACHE command) can reduce the amount of data read from Blob storage for iterative algorithms or repeated queries. This translates to faster completion and less time that clusters need to run. For instance, if your data scientists repeatedly query the same large dataset in a notebook, consider caching it in memory on the cluster (if it fits) or using Delta caching on the local SSDs of the workers. Then they won’t hit the storage for every query. This kind of optimization straddles the line between performance tuning and cost saving – essentially, speeding up workflows is a form of cost optimization because you’re paying for fewer compute seconds (Databricks Cost Optimization: A Practical Guide | overcast blog) (Databricks Cost Optimization: A Practical Guide | overcast blog).

To illustrate the impact: imagine a 1 TB dataset stored in 10,000 tiny JSON files vs. the same data in a well-partitioned Delta table with 200 large files, with older partitions in cool storage. The first scenario might take an hour of 10-node compute to read and process (and you pay high storage for 1 TB hot). The second scenario might take 10 minutes of 5-node compute to process (because Delta skipped a bunch of irrelevant data and read larger files efficiently), and you pay maybe 1/3 the storage cost for the cold portions. The difference in monthly cost for that pipeline could be enormous. Thus, investing in proper data storage management on Azure can yield direct savings in both storage bills and compute bills.

4. Best Practices & Recommendations

Bringing together the strategies above, here are the key recommendations for a team of data engineers (managing workflows/jobs) and data scientists (running notebooks), along with potential savings and trade-offs:

Right-size and right-type your clusters – Avoid keeping large interactive clusters running when a smaller or ephemeral job cluster would do. Use job clusters (Jobs Compute) for production pipelines whenever possible, since they charge a lower DBU rate than all-purpose notebooks (Best practices for cost optimization – Azure Databricks | Microsoft Learn). Similarly, prefer serverless compute for on-demand workloads; it eliminates idle time costs by only running when you have work (Serverless vs pool : r/databricks). For example, a nightly ETL that used to run on an always-on 8-node cluster could be moved to a job that spins up, runs on a cluster for an hour, then terminates – saving all the compute hours it used to sit idle. Many teams also find it useful to serve ad-hoc queries through Databricks SQL warehouses rather than a general Spark cluster, as the SQL warehouse can auto-stop when not in use and is optimized for cost per query (Best practices for cost optimization – Azure Databricks | Microsoft Learn) (How did you reduce your Databricks costs ? : r/dataengineering). The trade-off with fully serverless or job-scoped compute is slightly less real-time interactivity (you might wait ~30 seconds for a cluster spin-up in some cases) and needing to manage job scheduling, but the cost savings of consolidating work and cutting idle time are typically well worth it.
Consolidate usage and eliminate idle resources – Whenever feasible, have users share clusters or use multitask workflows instead of isolated clusters per person/task. A shared interactive cluster with autoscaling can handle multiple users’ notebooks and will shut down when nobody is using it. This drastically reduces the number of low-utilization clusters running concurrently. One user’s testimonial was to “make aggressive use of job clusters” and move to serverless for notebooks (How did you reduce your Databricks costs ? : r/dataengineering) – in practice, they cut down from many always-on clusters to primarily job-based clusters on demand. The potential savings here can be on the order of 30-50% if previously everyone had their own cluster running 8+ hours a day with only sporadic use. The trade-off is that teams need to coordinate (e.g. not everyone running heavy jobs at the exact same time on a shared cluster unless it can scale for it), but with proper autoscaling and policies, this is manageable. Always enable auto-termination on interactive clusters (e.g. 1 hour or even 15 minutes of inactivity) so you’re not paying overnight or over the weekend for something someone forgot to turn off (How did you reduce your Databricks costs ? : r/dataengineering) (Best practices for cost optimization – Azure Databricks | Microsoft Learn).
Leverage cost-saving features (autoscaling, spot, pools) – Ensure autoscaling is turned on for clusters with variable workloads, so you don’t pay for max capacity during low usage (Best practices for cost optimization – Azure Databricks | Microsoft Learn). Use spot instances for non-critical workloads to save up to ~90% on VM costs (Azure Spot Instances for Databricks AI | Databricks Blog) – for example, a batch job that costs $10 on on-demand VMs might cost only $1–$2 with spots, albeit with possible retries if a node is reclaimed. Many teams have reported substantial reductions in batch processing costs by using spot: one guide notes it’s great for ETL or ML training jobs where interruptions can be handled (Databricks Cost Optimization: A Practical Guide | overcast blog) (Databricks Cost Optimization: A Practical Guide | overcast blog). The trade-off is a slight hit to reliability, but Databricks mitigates this by automatically falling back to on-demand if needed (Azure Spot Instances for Databricks AI | Databricks Blog). Using instance pools can further optimize costs by cutting down startup time (meaning your jobs spend more time doing work and less time waiting) (Best practices for cost optimization – Azure Databricks | Microsoft Learn). While pools themselves don’t reduce VM-hour costs (you still pay for VMs sitting idle to buffer the pool), they improve efficiency for bursty job workflows and can be combined with spot (spot pools) for maximum savings. A best practice is to analyze job logs to find idle or wait times and apply these features accordingly (Databricks provides metrics and system tables to help with this monitoring (Best practices for cost optimization – Azure Databricks | Microsoft Learn)).
Optimize code and development practices – Sometimes cost savings come from improving the code rather than the infrastructure. Encourage your data scientists to write efficient code (e.g., use vectorized operations or built-in Spark functions instead of heavy Python UDFs where possible) – faster code means less cluster time and cost for the same task. Use Photon (Databricks’s optimized engine) for SQL and DataFrame operations; it can significantly speed up workloads, and it’s enabled by default in Serverless SQL and as an option in clusters (Best practices for cost optimization – Azure Databricks | Microsoft Learn). Although Photon might have a slightly higher DBU rate, if it cuts query time in half, you come out ahead in cost. In one Reddit discussion, a user mentioned evaluating Photon – it didn’t help their specific jobs, but it’s worth testing for your workload (How did you reduce your Databricks costs ? : r/dataengineering). On the development side, adopt CI/CD and local testing. By using tools like Databricks Asset Bundles or dbx, you can run many tests locally or in a lightweight environment, and only use a cluster for full integration tests and production runs. This means fewer ad-hoc experiments on expensive clusters. One team shared that by moving development to an IDE with dbx and using short-lived job clusters, they avoided the cost of interactive clusters completely. The trade-off here is investing time in tooling and possibly training the team on new workflows, but it pays off through a more controlled and efficient use of resources.
Storage and data management – Treat data as a first-class citizen in cost optimization. Partition and compact your data so that jobs don’t waste time on excessive I/O. Use Delta Lake and regularly OPTIMIZE tables – this can reduce query runtimes dramatically (we’ve seen 2–3× faster in many cases after compaction), directly reducing compute hours needed. Yes, there’s a cost to running the OPTIMIZE command itself, but think of it as paying a small maintenance fee to avoid a large performance penalty on every query. Also, archive cold data to cheaper storage. If you have years of logs or old feature tables that data scientists rarely access, keep only the last few months in the active workspace and move the rest to Archive storage (or even to a lower-cost data warehouse for long-term storage). You could save on the order of 50-80% of storage costs for that data, albeit with the understanding that retrieving it is slow (which is fine if it’s rarely needed). Finally, don’t overlook cost governance: implement tagging on Databricks resources (jobs, clusters) to track which projects are consuming what (Best practices for cost optimization – Azure Databricks | Microsoft Learn). This isn’t a direct savings, but it helps identify hotspots to optimize. For example, you might discover one machine learning notebook is using 60% of the cost – that’s a signal to dig in and see if it can be optimized or run on a cheaper instance. Regularly review cost reports (Azure Cost Management or Databricks cost reports) to ensure the optimizations you implement are actually yielding results and to catch any regressions early.

By following these best practices, a data engineering & data science team can significantly reduce Databricks costs while still meeting their workload requirements. In many cases, organizations have achieved on the order of 30-50% cost savings (or more) by eliminating waste (idle time, over-provisioned clusters) and leveraging cheaper compute options (How did you reduce your Databricks costs ? : r/dataengineering) (How did you reduce your Databricks costs ? : r/dataengineering). The trade-offs usually involve a bit more planning and automation – e.g., setting up jobs and schedules instead of manual runs, or coordinating shared resources – but the benefit is a more sustainable and scalable use of Databricks. Always remember: cost optimization is an ongoing process. Continuously monitor usage, experiment with new features (like serverless or newer instance types), and refine your approach as workloads evolve. With discipline and the techniques outlined above, you can keep your Databricks environment efficient and cost-effective, enabling your team to focus on data tasks without breaking the budget.

Sources: Best practices for Azure Databricks cost optimization (Best practices for cost optimization – Azure Databricks | Microsoft Learn) (Best practices for cost optimization – Azure Databricks | Microsoft Learn) (Best practices for cost optimization – Azure Databricks | Microsoft Learn) (Best practices for cost optimization – Azure Databricks | Microsoft Learn); Databricks community and blog insights on pools, serverless, and spot instances (Serverless vs pool : r/databricks) (Serverless vs pool : r/databricks) (Azure Spot Instances for Databricks AI | Databricks Blog) (Azure Spot Instances for Databricks AI | Databricks Blog); Recommendations on cluster sharing and autoscaling (Cost Breakdown of Databricks by job or user – Microsoft Q&A) (Databricks Cost Optimization: A Practical Guide | overcast blog); Guidance on Databricks Asset Bundles and local development (Simplify your workflow deployment with Databricks Asset Bundles: Part I – Xebia) (Local Development using Databricks Clusters – Pivotal BI); Delta Lake and storage optimization tips (Databricks Cost Optimization: A Practical Guide | overcast blog) (Azure Cost Optimisation: 12 Quick Tips for a Lower Bill).

Prompts and Context For the LLM

Requirements:

The workload is based on batch processing in the shape of micro-batching setup. In most of the cases the process is done by using UDFs.
Research over different types of instance optimization
Consumption pattern is mostly daily during European night time.
No restriction on the instance, just trying to explore and research the possibilities from serverless to dedicates VMs on Azure. Explore spot instances mixed with autoscaling config and see how the job execution can be handle.
Not only into compute, also explore optimization on the storage size with a focus into the cloud storage as ADSLv2 (abfs) +dbfs.

Context:

The workspace will be use by data engineers and data sciences, data science will be running notebooks most of the time, meanwhile the data engineers will be dealing with workflow and jobs runs. Its also a good idea to explore the use of Databricks Assets bundle to run staff locally and maybe use to deploy to the workspace.

This loads are the ones that usually consume more money since usually there is one personal instance set for each user. Would be great to explore shared possibilities for this setups or the use or local (non-cloud compute) to run the things while developing, including Databricks Connect. Also the UC (delta sharing) can be use to download subsets of data to run some test with the instance stored data.