Designing Colocation for High‑Density AI: A Practical Checklist for Power, Cooling, and Connectivity
A practical checklist for AI colocation: map MW, rack kW, cooling, and carrier choices to avoid costly deployment mistakes.
AI colocation is no longer a generic facilities exercise. For engineering and DevOps teams, it is an infrastructure design problem with hard constraints: megawatts available now, rack density that can exceed 100 kW, cooling architectures that can remove extreme heat, and network paths that keep distributed training and inference stable under load. If you choose the wrong site, the symptoms show up fast: delayed deployment, throttled GPUs, overloaded electrical gear, noisy alerts, and costly redesigns after the contract is signed. This guide turns those risks into a concrete colocation checklist you can use to evaluate sites, align procurement, and avoid the most common mistakes in AI infrastructure.
As AI accelerators become denser, the old rules of thumb break down. A room that was fine for general-purpose servers may fail at the breaker level, the cooling loop, or the fiber meet-me room. This is why the planning process should be anchored in real workload numbers, not vague capacity claims; a provider’s “future expansion” only helps if your deployment timeline matches it. For a broader view of how AI infrastructure is changing, see our grounding piece on next-generation AI infrastructure, which emphasizes immediate power, liquid cooling, and strategic location.
1) Start with the workload, not the building
Translate model plans into power and rack density
The first mistake teams make is asking, “Which data center has space?” when the real question is, “What does our AI cluster require at full utilization?” Begin with GPU count, node count, storage footprint, and target rack layout, then convert that into per-rack kW and total facility draw. For example, a two-rack pilot using high-end accelerator servers can consume more than a traditional row’s worth of power, while a production training pod can quickly move into multi-megawatt territory. A practical approach is to define three numbers before any site visit: rack kW, pod MW, and growth envelope over 12–24 months.
Do not forget networking and storage when estimating density. AI clusters often need more than compute power because they also require high-throughput east-west traffic, out-of-band management, and fast access to datasets. A design that underestimates storage or oversubscribes network fabrics can create artificial bottlenecks that look like GPU inefficiency but are actually infrastructure constraints. That is why teams should document workload assumptions in a shared capacity plan, similar to how procurement teams use a market research to capacity plan workflow before committing to a site.
Set deployment phases and acceptance thresholds
A good colocation plan is phased. Phase one should be a minimum viable AI pod that can validate power quality, cooling behavior, and network latency in production conditions. Phase two should be capacity that is already engineered into the same location, not a hope on a slide deck. Phase three should define what happens if the workload outgrows the first site, including whether the organization will replicate, expand, or move to a second carrier-neutral facility. Planning this way prevents the team from treating AI infrastructure like a one-time purchase instead of an operational lifecycle.
Write acceptance thresholds for each phase. For example, specify the maximum rack inlet temperature, acceptable PUE targets, required redundant power paths, and the latency budget to your primary cloud regions. If you already manage other critical operations, use the same discipline you would use for incident runbooks and vendor evaluation, much like the structure in vendor evaluation frameworks that force concrete questions instead of generic promises. The goal is to make “ready for AI” a measurable statement.
Checklist items for workload definition
Before you talk to colocation providers, capture the following in writing: peak and average rack kW, total IT load in MW, desired growth rate, acceptable downtime window, cooling preference, network latency requirements, and any data residency or compliance constraints. If you cannot quantify the workload, providers will fill in the blanks with assumptions that may not match reality. This is especially dangerous for AI because procurement lead times for transformers, switchgear, and liquid cooling are much longer than the time it takes to sign a lease. The earlier the requirements are locked, the lower the chance of expensive rework.
2) Power is the first gate: MW, redundancy, and delivery timing
Separate IT load from facility load
One of the most common procurement mistakes is confusing IT load with total facility draw. Your cluster may need 1 MW of usable IT capacity, but the building-level requirement will be higher once cooling, distribution losses, and redundancy are added. That distinction matters because colocation pricing, generator sizing, and utility planning depend on different layers of the power stack. Ask providers whether they quote usable critical load, full building load, or only nameplate utility intake.
For AI, “future megawatts” are less valuable than immediate megawatts. A site with a 12-month buildout plan might still be a dead end if your cluster needs to arrive this quarter. This is why the market is shifting toward ready-now capacity and why operators increasingly treat power as a deployment dependency, not a facilities feature. If your team is comparing capacity windows, the logic in choosing compute under supply constraints translates well to colocation: prioritize what is actually available, not just advertised.
Choose redundancy based on the workload’s tolerance for interruption
Not every AI workload needs identical redundancy, but every workload needs an explicit choice. Training jobs often tolerate rescheduling better than low-latency inference, while shared feature stores and orchestration layers can become single points of failure if underdesigned. Decide whether you need N, N+1, 2N, or a segmented architecture with different tiers of resilience. Then map that choice to utility feeds, UPS topologies, generator runtime, and maintenance windows.
Do not assume more redundancy is always better. At very high densities, overbuilding the wrong layer can inflate cost without solving the actual risk, especially when the larger issue is utility arrival timing or switchgear lead time. In practice, teams should ask whether the provider can support maintenance without taking an AI pod offline, whether there are separate electrical rooms, and whether the facility has proof of successful high-density operations. A disciplined procurement team will also review contract terms with the same seriousness they would use for vendor checklists for AI tools, because uptime claims are only as good as the service credits and operational commitments behind them.
Power questions to ask every provider
Ask these questions in every sales call: What is the maximum kW per cabinet today? What is the maximum per row and per suite? Is the power already live or only planned? What are the utility interconnect timelines? What electrical topology is standard, and what changes for AI pods? Can the site support stepwise growth from a few hundred kW to multiple MW without a migration? If the provider cannot answer these in concrete numbers, you do not yet have a usable deployment target.
3) Cooling strategy: air, DLC, and rear-door options
Know when air cooling stops being enough
Traditional air cooling can support many enterprise workloads, but AI cluster design quickly pushes beyond what standard rows can handle efficiently. Once rack density climbs into the high double digits or beyond, hot spots, airflow short-circuiting, and fan power overhead become serious constraints. This is where teams must choose between enhanced air, direct-to-chip liquid cooling, or hybrid approaches. The right answer depends on power density, hardware roadmap, and whether the colocation provider can integrate liquid loops safely.
High-density AI environments also benefit from the operational simplicity of standardized cooling tiers. If your first cluster uses air-assisted cooling but the second cluster will require liquid, design the facility decision around the second cluster, not the first. In other words, do not buy yourself into a dead end. A single site that cannot progress from 20 kW to 60 kW to 100 kW per rack will force an expensive second migration just as your AI program gains momentum.
Compare direct-to-chip liquid cooling and rear-door heat exchangers
Direct-to-chip liquid cooling (DLC) removes heat closer to the source and is increasingly attractive for the hottest AI deployments. It can support very high densities with better thermal efficiency, but it also introduces new requirements: coolant distribution units, leak detection, water quality management, and maintenance procedures that technicians must understand. Rear-door heat exchangers (RDHx) are often easier to adopt because they attach to the rack and can relieve heat without redesigning every server node, making them a practical bridge for teams moving from air to liquid-assisted infrastructure.
RDHx is not a universal answer. It depends on rack geometry, return air assumptions, and how much heat the facility can reject through the water loop. DLC is more scalable for next-generation GPU platforms, while RDHx may be easier for transitional deployments or mixed environments. The safest approach is to ask whether the colocation site supports both, and whether its mechanical design can handle your foreseeable cooling roadmap rather than only your first purchase order. For broader operational resilience planning, teams often borrow playbook thinking from shipping exception playbooks: define the failure modes before they happen.
Checklist for cooling validation during site tours
When you tour a site, inspect more than the brochure. Ask to see real deployment references at similar densities, maintenance access to pumps and manifolds, and the provider’s process for isolating a loop during service. Confirm whether chilled water, warm water, or direct liquid loops are supported, and how the site monitors flow, temperature, and leak events. Finally, ask how the provider handles mixed rows, because many AI programs need a combination of legacy servers, storage, and liquid-cooled accelerators during the transition period.
4) Connectivity: carrier neutral, latency, and routing design
Why carrier neutrality matters for AI operations
For AI colocation, network design is not just about internet bandwidth. It is about how efficiently you can move training data, replicate artifacts, access cloud services, and fail over between regions or providers. A carrier-neutral facility gives you more leverage because you can choose the right mix of transit, cloud on-ramps, and private connectivity instead of being locked into a single network path. That flexibility becomes especially important when you are balancing cost, performance, and vendor independence across a multi-cloud architecture.
Carrier neutrality also gives your team room to optimize latency and resilience. If one provider has better routes to a cloud region, a research partner, or an enterprise campus, you can adjust interconnects without relocating the cluster. For teams that already think in terms of supply chain resilience and optionality, the logic is similar to supply chain continuity planning: do not depend on a single path when multiple viable ones exist.
Define the latency budget before choosing the metro
Many teams overfocus on bandwidth and underfocus on latency. AI training across distributed sites, model serving close to users, and storage replication all suffer when the metro choice adds avoidable delay. Decide where the work must happen: close to users, close to cloud on-ramps, or close to a specific data source. Then use that latency budget to filter metros before you evaluate building-specific features.
Practical rule: if your workflow depends on cloud burst capacity, choose a site with low-latency connectivity to your primary hyperscalers and enough carrier diversity to keep routing stable. If your inference workload is customer-facing, prioritize edge proximity and consistent jitter. If your training cluster needs to exchange data with another site, consider private interconnects rather than relying solely on public internet transit. This decision framework resembles the discipline behind platform acquisition strategy: location decisions should support the operating model, not the other way around.
Network checklist items
Ask every colocation provider: Which carriers are physically present? Are cloud on-ramps available? How many diverse fiber entry points exist? Can you order cross-connects quickly? Is the facility carrier neutral in practice, or only in marketing language? Are private interconnects and direct cloud connections supported from the same suite? The correct answers should be specific enough to map directly into your network design and procurement timeline.
5) Build a comparison matrix before signing any contract
Use a scoring table to compare sites side by side. That way, you can make the tradeoffs visible instead of relying on impressions from a polished tour. The most important comparison dimensions for AI colocation are power readiness, cooling compatibility, rack density support, connectivity, and commercial flexibility. Add compliance, expansion rights, and service response time if the deployment is business-critical.
| Evaluation Factor | What to Verify | Why It Matters for AI |
|---|---|---|
| Immediate MW availability | Live utility-backed capacity, not only planned expansion | AI deployment schedules are often blocked by power lead times |
| Rack density support | Verified kW per rack and per row | High-density GPU racks can exceed legacy thresholds |
| Cooling compatibility | Air, DLC, RDHx, or hybrid support | Prevents thermal throttling and future migrations |
| Carrier neutrality | Multiple carriers, cloud on-ramps, diverse fiber paths | Improves latency, resilience, and pricing leverage |
| Deployment flexibility | Space for phased growth and mixed hardware generations | Reduces stranded assets as AI hardware evolves |
| Operational visibility | Sensors, dashboards, and incident response processes | Enables reliable operations and faster troubleshooting |
This kind of matrix is useful because AI infrastructure decisions are usually made under uncertainty. Teams are comparing energy availability, thermal readiness, commercial terms, and network design simultaneously, often with incomplete information. A weighted scorecard turns the discussion into a repeatable process, which helps engineering, finance, and procurement align on the same facts. For teams that like operational dashboards, the idea is similar to financial-style monitoring: convert complexity into clear decision signals.
6) Procurement mistakes that quietly break AI deployments
Buying today’s cluster for tomorrow’s hardware
One of the easiest mistakes is procuring only for the first generation of GPUs. AI hardware moves quickly, and thermal and power characteristics can change materially from one platform to the next. If your facility can only support a narrow band of densities, your next procurement cycle may force a new site search even though the original contract is still active. That is why procurement should ask not only “Can this site hold my current build?” but “What does this site look like when the next accelerator generation lands?”
Teams also underestimate the operational complexity of mixed estates. A colocation environment might need to support legacy servers, storage nodes, management plane equipment, and liquid-cooled compute all at once. The facility must allow staged migration without creating outages or thermal conflicts. If your organization manages change well, you already know this is the same reason why change announcements need structure: ambiguity creates downstream friction.
Ignoring lead times for electrical and mechanical gear
AI projects often slip because teams focus on rack delivery but overlook switchgear, transformers, pumps, CDU capacity, and fiber installation. These items can have longer lead times than servers, and in some markets they are the real schedule risk. The practical fix is to build a dependency map with the longest-lead items on the critical path, then use that map in weekly procurement reviews. If the provider cannot give realistic dates for every enabling component, treat the site as not yet ready.
Another common failure is assuming the contract will automatically guarantee the necessary build-out sequence. Ask for milestone-based commitments, acceptance testing criteria, and remedies if the facility misses dates. This protects you against “capacity promised later” language that sounds attractive during sales but fails during execution. Think of it like the rigor required in secure support desk design: process and evidence matter more than optimistic claims.
What to lock into the contract
At minimum, lock in usable kW per rack, total committed MW, cooling compatibility, cross-connect pricing, carrier access expectations, maintenance windows, and expansion rights. Include a service escalation path and a documented process for facility incidents that affect the AI pod. If the provider is willing, add deployment-specific acceptance tests that verify temperature stability, power quality, and network handoff before full production. The more concrete the language, the less likely your “AI-ready” site becomes a general-purpose building with a premium price tag.
7) Operational readiness: monitoring, runbooks, and incident response
Design observability around the physical stack
AI operations teams should monitor more than application logs. The infrastructure stack needs telemetry from power feeds, PDUs, rack sensors, coolant loops, network interfaces, and facility alarms. If you can only see GPU utilization but not inlet temperatures or power excursions, you will miss the causes of throttling before they become outages. A unified view is essential because the fastest path to resolution is often correlating physical changes with workload behavior.
This is where cross-team dashboard design pays off. Engineering, SRE, and facilities teams should share a common vocabulary for events such as power derates, loop temperature shifts, carrier outages, and thermal alarms. If your team already uses analytics-centric operating models, the approach mirrors heatmap-driven performance analysis: the right signals expose where the bottleneck really lives.
Write AI-specific runbooks before go-live
Runbooks for AI colocation should cover power loss, partial cooling failure, network degradation, and maintenance-mode transitions. They should also define what happens when a training job is paused, resumed, or restarted, because the application layer may need special handling after an infrastructure event. Put ownership in writing: who talks to the carrier, who checks the CDU, who validates job state, and who decides whether to fail over. Without clear ownership, facility events become multi-team coordination problems at the worst possible moment.
It is also wise to rehearse the first 30 minutes of an incident. That window determines whether the team will preserve data, avoid cascading failures, and communicate accurately to stakeholders. Treat the drill as seriously as any production release. In high-stakes environments, disciplined rehearsal is often the difference between a recoverable event and a costly outage, a lesson that aligns with the broader operational rigor discussed in infrastructure excellence frameworks.
Metrics that matter after cutover
After cutover, watch a small number of metrics closely: sustained rack power, thermal headroom, loop stability, packet loss, jitter, and job completion consistency. If these trend in the wrong direction, the issue may not be software at all; it may be the physical environment under load. Set alert thresholds conservatively during the first 30 to 60 days and tune them after you have enough baseline data. This reduces alert noise while still protecting the deployment from hidden instability.
8) A practical colocation checklist for AI teams
Pre-sales checklist
Before site selection, require the provider to answer these questions in writing: What MW are live today? What is the maximum rack density now? Which cooling modes are supported? Which carriers and cloud on-ramps are physically present? What are the lead times for electrical and mechanical expansions? What proof do they have of similar AI deployments? Written answers help eliminate ambiguity and provide a clean record for internal review.
Also ask for reference architectures and references from customers with similar density profiles. If possible, request a technical session with the operations team, not just the sales team. The people who maintain the facility can tell you whether the infrastructure is truly designed for AI or merely adapted for it. That distinction is often visible in the details, from manifold design to cross-connect workflow to incident escalation.
Deployment checklist
Once the site is selected, confirm the delivery sequence for racks, servers, network gear, and cooling interfaces. Validate the electrical plan with one-line diagrams and verify that every phase of the deployment has an acceptance test. Coordinate with carriers early so that physical circuits, transceivers, and private links are ready when your equipment arrives. If the facility supports liquid cooling, perform leak, pressure, and isolation tests before production nodes are energized.
During deployment, maintain a change log that records who changed what, when, and why. In dense AI environments, a small change in airflow direction, switch placement, or power loading can have outsized consequences. Good documentation keeps operations reproducible as the environment scales. This discipline is especially important for teams that operate across multiple sites and need consistent deployment patterns.
Post-launch checklist
After launch, validate actual power draw against planned draw, measure thermal headroom at peak load, and review network performance under training and inference traffic. Check whether redundancy behaved as expected during maintenance or failover tests. Then compare actual operating costs against the original TCO model, including power, cross-connects, cooling surcharges, and any premium services. AI colocation is a long-term operating choice, so the final test is whether the site remains efficient after the first production cycle.
Pro Tip: If a facility only looks good on paper, it is usually because the hard parts have been deferred. Demand live power numbers, a real cooling roadmap, and carrier details before you commit to moving a single GPU.
9) Decision framework: how to choose the right site fast
Use a three-part filter
The fastest way to eliminate poor options is to use a three-part filter: power readiness, thermal readiness, and network readiness. If a site fails any one of those, move on unless the provider can show a concrete remediation plan and a realistic delivery date. This is much more efficient than debating small pricing differences between sites that are structurally unsuitable for your workload. For AI programs, the wrong site is never cheap because the migration cost comes later.
For teams comparing multiple metros, weight the filter based on business need. A training-heavy workload may prioritize power first, while a latency-sensitive inference service may put connectivity first. If your organization relies on hybrid cloud, pair the filter with private interconnect availability and cloud proximity. This keeps the location decision aligned to the workload instead of to generic real estate logic.
Red flags that should stop procurement
Watch out for vague answers about “future AI capacity,” undocumented cooling capabilities, overloaded sales language without engineering proof, and carrier-neutral claims that do not include a carrier list. Another red flag is a provider that cannot clearly separate facility load from IT load. If the numbers are fuzzy at the start, they usually get worse during implementation. Your procurement team should treat ambiguity as a cost signal, not a negotiation opportunity.
What success looks like
Success means your team can deploy on schedule, scale without a redesign, and operate within thermal and power headroom. It means the network paths are stable, the site can support the next hardware generation, and the contract reflects operational reality instead of marketing language. Most importantly, it means the infrastructure is enabling AI delivery rather than slowing it down. That is the real objective of any serious colocation checklist for modern AI systems.
10) Conclusion: turn AI colocation into a repeatable operating model
High-density AI colocation is not about finding the cheapest rack or the flashiest facility. It is about matching workload requirements to a site that can deliver power now, remove heat reliably, and connect you to the places your models need to reach. Once you frame the problem that way, procurement becomes more disciplined, deployment becomes more predictable, and operations become easier to scale. The teams that win are the ones that treat infrastructure as a strategic system, not a vendor checkbox.
If you want to go deeper on the broader infrastructure shift behind this trend, revisit our guide to redefining AI infrastructure. If your team is also building the governance side of the deployment, the article on vendor checklists for AI tools pairs well with the commercial review process. The right colocation decision should make your AI program faster, safer, and easier to operate for years—not just for the first batch of servers.
Related Reading
- Receipt to Retail Insight: Building an OCR Pipeline for High‑Volume POS Documents - Useful for teams planning data-heavy ingestion workflows alongside AI infrastructure.
- What Game-Playing AIs Teach Threat Hunters - A practical look at search and pattern-recognition ideas that also apply to operations.
- Optimize Cooling With Solar + Battery + EV - Helpful for thinking about load shifting and cooling efficiency under constraint.
- How to Pick the Right Portable Power Station - A simpler framing of power sizing decisions that maps well to infrastructure planning.
- AEO for Links - Learn how to structure links so they are easier to surface and cite.
FAQ
What rack density should I plan for in AI colocation?
Start with your actual hardware roadmap, not generic enterprise assumptions. Many AI deployments quickly move beyond traditional densities, so you should model current and next-generation racks separately. If your next accelerator platform requires liquid cooling or dramatically more power, choose a site that can support that transition without relocation.
Is liquid cooling mandatory for AI infrastructure?
Not always, but it is becoming much more common as densities rise. Air cooling may still work for smaller pilots or transitional deployments, but direct-to-chip liquid cooling and rear-door heat exchangers are often necessary for sustained high-density AI. The right answer depends on your rack kW target and the provider’s mechanical design.
Why is carrier-neutral colocation important?
Carrier neutrality gives you flexibility, pricing leverage, and better routing choices. It allows you to select carriers, cloud on-ramps, and interconnects based on performance and resilience rather than vendor lock-in. That matters more as AI clusters depend on hybrid cloud access and low-latency connectivity.
How do I compare colocation sites objectively?
Use a weighted scorecard that includes live power availability, cooling compatibility, rack density support, connectivity, expansion rights, and commercial flexibility. Ask for written answers and compare them against your workload requirements. If a site cannot prove it can support your density and cooling targets, it should not rank highly.
What is the most common mistake teams make?
The biggest mistake is choosing a site based on current needs only. AI hardware evolves quickly, and the facility that works for a pilot can become a bottleneck for production. Teams should select a location that fits the next two hardware cycles, not just the first deployment.
How can DevOps teams help with colocation planning?
DevOps teams can translate application requirements into concrete infrastructure constraints, such as latency budgets, job resilience, and monitoring needs. They can also define runbooks, incident response workflows, and acceptance tests for the physical environment. This ensures the facility supports operations, not just equipment placement.
Related Topics
Jordan Ellis
Senior Infrastructure Editor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Zero-Trust for Cloud-Native Stacks: A Practical Checklist for DevSecOps
Multi‑tenant Retail Analytics SaaS: Architecture, Isolation, and Observability
The Advent of Driverless Trucks: Integrating Autonomy into Traditional TMS
The Good, The Bad, and The Other: Ranking Android Skins for Developers
Using AI for File Management: Benefits and Risks of Anthropic's Claude Cowork
From Our Network
Trending stories across our publication group