Ideally, all enterprise data and analytics platforms would be consolidated within uniform cloud vendors, while meeting both IT and business needs. The reality is that most enterprises are living with a complex web of legacy and modern technologies, and they are on a path to consolidate towards the ideal target state.
Below are some multi-cloud (and multi-vendor) pitfalls to watch out for, and some best practices to consider incorporating as part of developing and evolving your platform strategy.
Fallacies of Distributed Computing
Peter Duetsch and other fellows from Sun Microsystems first penned the "8 Fallacies" back in 1994:
- The network is reliable;
- Latency is zero;
- Bandwidth is infinite;
- The network is secure;
- Topology doesn't change;
- There is one administrator;
- Transport cost is zero;
- The network is homogeneous.
When it comes to multi-cloud platform architecture motives for caution, the 2nd, 3rd and 7th fallacies are harrowingly relevant today ("Latency is zero", "Bandwidth is infinite" and Transport cost is zero" respectively).
Latency is Zero
Moving data across cloud platforms is slow, and introduces additional latency. This reduces various team's agility, and makes certain use cases (such as real-time decisioning) impractical without the costly duplication of systems or other workarounds. Although contingencies may exist, they introduce additional complexity, maintenance costs, and failure points for risk.
When circumstances of your organization force data to be stored across multiple platforms, consider making use of a federated query feature like Databrick's Lakehouse Federation. Although latency itself is unavoidable, you will have the opportunity to avoid the worst data duplication and maintenance pitfalls with centralized catalog tools like Unity Catalog.
Bandwidth is Infinite
The ability to copy data for later transformation to suit different purposes is not unlimited. Ideally, work should be performed in the same location as where the data is and is stored in an open source formats, so that that diverse tools have unimpeded access. For example, open standards like parquet files on Databrick's Lakehouse would be preferred over proprietary columnar storage formats like those found on Snowflake.
Transport Cost is Zero
Getting data in (ingress) and out (egress) of cloud platforms can often be an overlooked cost when evaluating cloud vendors and potential architecture options. Although it is possible to create secure VPN connections between cloud platforms, this is often a highly cost prohibitive strategy.
At the time of writing, internet egress costs for top cloud hyperscalers are easily approaching a dime or starting to approach a quarter per gigabyte, depending on the region:
- Microsoft Azure ($0.087 to $0.181)
- AWS ($0.09)
- Google Cloud Platform ($0.12 - $0.23)
With single departments within the enterprise generating terabytes of data daily, it can quickly become impractical to move or transfer even a modest set of historical data between cloud platforms without incurring measurable costs.
While some stakeholders may be enchanted by a particular product that is only available, or runs best on a particular cloud environment, the egress costs should not be underestimated. Even if there are advanced APIs and connectors available, it may not matter if your IT budget's data transfer costs start to balloon.