为了冷却数据中心服务器,微软转向沸腾的液体 

标题:为了冷却数据中心服务器,微软转向沸腾的液体

To cool datacenter servers,Microsoft turns to boiling liquid

 

在哥伦比亚河东岸的这个数据中心,微软员工之间发送的电子邮件和其他通信实际上正在让一个装满计算机服务器的钢制容器内的液体沸腾。

Emails and other communications sent betweenMicrosoft employees are literally making liquid boil inside a steel holdingtank packed with computer servers at this datacenter on the eastern bank of theColumbia River.

 

与水不同的是,沙发形储罐内的液体对电子设备无害,设计成沸腾温度为122华氏度,比水的沸点低90度。

Unlike water, the fluid inside the couch-shapedtank is harmless to electronic equipment and engineered to boil at 122 degreesFahrenheit, 90 degrees lower than the boiling point of water.

 

服务器正在做的工作产生的沸腾效应将热量从承担繁重工作的计算机处理器中带走。低温煮沸使服务器能够在全功率下连续运行,而不会因过热而导致故障。

The boiling effect, which is generated by thework the servers are doing, carries heat away from laboring computerprocessors. The low-temperature boil enables the servers to operatecontinuously at full power without risk of failure due to overheating.

 

在储罐内部,从沸腾的流体中升起的蒸汽与储罐盖中的冷却冷凝器接触,这导致蒸汽变为液体,然后如下雨般滴回到浸入式服务器中,从而形成了一个闭环冷却系统。

Inside the tank, the vapor rising from the boilingfluid contacts a cooled condenser in the tank lid, which causes the vapor tochange to liquid and rain back onto the immersed servers, creating a closedloop cooling system.

 

位于华盛顿州雷蒙德市的微软数据中心高级开发团队的首席硬件工程师胡萨姆·艾丽萨(Husam Alissa)说:“我们是第一家在生产环境中运行两相浸没冷却的云提供商。”

“We are thefirst cloud provider that is running two-phase immersion cooling in aproduction environment,” said Husam Alissa, a principal hardware engineer onMicrosoft’s team for datacenter advanced development in Redmond, Washington.

为了冷却数据中心服务器,微软转向沸腾的液体(图1)

 

Microsoft的数据中心高级开发团队的首席软件工程师IoannisManousakis(左)和Microsoft的数据中心高级开发团队的首席硬件工程师HusamAlissa(右)检查Microsoft数据中心的两相浸入式冷却箱的内部。由GeneTwedt为Microsoft摄影。 

IoannisManousakis,a principal software engineer with Azure (left), and Husam Alissa, a principalhardware engineer on Microsoft’s team for datacenter advanced development(right), inspect the inside of a two-phase immersion cooling tank at aMicrosoft datacenter. Photo by Gene Twedt for Microsoft.

 

数据中心的摩尔定律 

Moore’s Law for the datacenter

 

在风冷计算机芯片技术的可靠发展减慢之际,Microsoft长期计划的下一步就是在生产环境中部署两相浸入式冷却,以适应对更快,功能更强大的数据中心计算机的需求。

The production environment deployment oftwo-phase immersion cooling is the next step in Microsoft’s long-term plan tokeep up with demand for faster, more powerful datacenter computers at a timewhen reliable advances in air-cooled computer chip technology have slowed.

 

几十年来,芯片的进步源于将更多的晶体管封装到相同尺寸的芯片上的能力,这使得计算机处理器的速度每两年大约翻一番,而不会增加其电力需求。 

For decades, chip advances stemmed from theability to pack more transistors onto the same size chip, roughly doubling thespeed of computer processors every two years without increasing their electricpower demand.

 

这种倍增现象被称为摩尔定律摩尔定律在1965年观察到了这一趋势,并预测这种趋势将持续至少十年。它一直持续到2010年代,现在开始放慢速度。 

This doubling phenomenon is called Moore’s Lawafter Intel co-founder Gordon Moore, who observed the trend in 1965 andpredicted it would continue for at least a decade. It held through the 2010sand has now begun to slow.

 

那是因为晶体管的宽度已经缩小到原子级,并且已经达到物理极限。同时,Alissa指出,对诸如人工智能等高性能应用的更快计算机处理器的需求正在加速增长。 

That’s because transistor widths have shrunk tothe atomic scale and are reaching a physical limit. Meanwhile, the demand forfaster computer processors for high performance applications such as artificialintelligence has accelerated, Alissa noted.

 

为了满足性能需求,计算行业已经转向可以允许更多电能消耗的芯片架构。例如,中央处理单元或CPU已从每个芯片150瓦增加到300瓦以上。图形处理单元(GPU)已增加到每个芯片700瓦以上。

To meet the need for performance, the computingindustry has turned to chip architectures that can handle more electric power.Central processing units, or CPUs, have increased from 150 watts to more than300 watts per chip, for example. Graphics processing units, or GPUs, haveincreased to more than 700 watts per chip.

 

输入到这些处理器的电能越多,芯片就会变得越热。增加的热量提升了冷却要求,以防止芯片发生故障。 

The more electric power pumped through theseprocessors, the hotter the chips get. The increased heat has ramped up coolingrequirements to prevent the chips from malfunctioning.

 

位于Redmond的微软数据中心高级开发小组杰出工程师及副总裁克里斯蒂安·贝拉迪(Christian Belady)说:“风冷不够了。”“这就是驱使我们进行浸入式冷却的原因,我们可以在其中直接将芯片的表面煮沸。” 

“Air cooling is not enough,” said ChristianBelady, distinguished engineer and vice president of Microsoft’s datacenteradvanced development group in Redmond. “That’s what’s driving us to immersioncooling, where we can directly boil off the surfaces of the chip.”

 

他指出,液体中的热传递比空气更有效。

Heat transfer in liquids, he noted, is ordersof magnitude more efficient than air.

 

他补充说,此外,向液体冷却的转变为整个数据中心带来了类似摩尔定律的思维模式。 

What’s more, he added, the switch to liquidcooling brings a Moore’s Law-like mindset to the whole of the datacenter.

 

他说:“液体冷却使我们能够变得更密集,从而在数据中心级别上继续保持摩尔定律的趋势。” 

“Liquid cooling enables us to go denser, andthus continue the Moore’s Law trend at the datacenter level,” he said.

为了冷却数据中心服务器,微软转向沸腾的液体(图2)

 

微软数据中心高级开发小组的杰出工程师兼副总裁克里斯蒂安·贝拉迪(Christian Belady)站在微软数据中心的两相浸入式冷却水箱旁边。由Gene Twedt为Microsoft摄影。

ChristianBelady, distinguished engineer and vice president of Microsoft’s datacenteradvanced development group, stands next to a two-phase immersion cooling tankat a Microsoft datacenter. Photo by Gene Twedt for Microsoft.

 

从加密货币矿工那里学到的教训 

Lesson learned from cryptocurrency miners

 

Belady指出,液体冷却是一种行之有效的技术。如今,道路上的大多数汽车都依靠它来防止发动机过热。包括微软在内的多家技术公司正在试验冷板技术,该技术通过将液体通过金属板输送到服务器来冷却服务器。 

Liquid cooling is a proven technology, Beladynoted. Most cars on the road today rely on it to prevent engines fromoverheating. Several technology companies, including Microsoft, areexperimenting with cold plate technology, in which liquid is piped throughmetal plates, to chill servers.

 

加密货币行业的参与者率先开发了用于计算设备的浸没式液冷,利用它来冷却记录数字货币交易的芯片。 

Participants in the cryptocurrency industrypioneered liquid immersion cooling for computing equipment, using it to coolthe chips that log digital currency transactions.

 

微软深入研究了液体浸没作为AI等高性能计算应用程序的冷却解决方案。除其他事项外,调查显示,两相浸入式冷却可将任何给定服务器的功耗降低5%至15%。 

Microsoft investigated liquid immersion as acooling solution for high-performance computing applications such as AI. Amongother things, the investigation revealed that two-phase immersion cooling reducedpower consumption for any given server by 5% to 15%.

 

这些发现促使微软团队与数据中心IT系统制造商和设计师Wiwynn合作开发了两阶段浸入式冷却解决方案。第一个解决方案现在在昆西的Microsoft数据中心运行。 

The findings motivated the Microsoft team towork with Wiwynn, a datacenter IT system manufacturer and designer, to develop atwo-phase immersion cooling solution. The first solution is now running atMicrosoft’s datacenter in Quincy.

 

沙发形的水箱中充满了3M的特别设计研发的流体。3M的液体冷却液具有介电特性,使其成为有效的绝缘体,使服务器在完全浸入液体的情况下仍能正常运行。 

That couch-shaped tank is filled with anengineered fluid from 3M. 3M’s liquid cooling fluids have dielectric propertiesthat make them effective insulators, allowing the servers to operate normallywhile fully immersed in the fluid.

 

微软技术研究员兼公司副总裁、Azure计算首席架构师马库斯·丰图拉(MarcusFonoura)表示,这种向两阶段液体浸泡冷却的转变为高效管理云资源提供了更大的灵活性。 

This shift to two-phase liquid immersioncooling enables increased flexibility for the efficient management of cloudresources, according to Marcus Fontoura, a technical fellow and corporate vicepresident at Microsoft who is the chief architect of Azure compute.

 

例如,管理云资源的软件可以将数据中心计算需求的突然峰值分配给液冷箱中的服务器。这是因为这些服务器可以在更高的功率下运行-这一过程被称为超频-而不会有过热的风险。 

For example,software that manages cloud resources can allocate sudden spikes in datacentercompute demand to the servers in the liquid cooled tanks. That’s because theseservers can run at elevated power – a process called overclocking – withoutrisk of overheating.

 

方图拉说:“例如,当你到达1点钟或2点钟的时候,Teams就会出现一个巨大的峰值,因为人们在同一时间加入会议。”“浸入式冷却为我们提供了更大的灵活性来处理这些突发性工作负载。” 

“For instance, we know that with Teams when youget to 1 o’clock or 2 o’clock, there is a huge spike because people are joiningmeetings at the same time,” Fontoura said. “Immersion cooling gives us moreflexibility to deal with these burst-y workloads.”

为了冷却数据中心服务器,微软转向沸腾的液体(图3)

 

沸腾的液体带走了Microsoft数据中心的计算机服务器产生的热量。微软是第一家在生产环境中运行两阶段浸入式冷却的云提供商。由Gene Twedt为Microsoft摄影。 

Boilingliquid carries away heat generated by computer servers at a Microsoftdatacenter. Microsoft is the first cloud provider to run two-phase immersioncooling in a production environment. Photo by Gene Twedt for Microsoft.

 

可持续数据中心 

Sustainable datacenters

 

Fonoura补充说,将两阶段沉浸冷却服务器添加到可用的计算资源组合中,还将允许机器学习软件在整个数据中心(从电力和冷却到维护技术人员)更有效地管理这些资源。 

Adding the two-phase immersion cooled serversto the mix of available compute resources will also allow machine learningsoftware to manage these resources more efficiently across the datacenter, frompower and cooling to maintenance technicians, Fontoura added.

 

他说:“我们不仅会对效率产生巨大影响,还会对可持续性产生巨大影响,因为你要确保不会浪费,确保我们部署的每一件IT设备都能得到很好的利用。” 

“We will have not only a huge impact onefficiency, but also a huge impact on sustainability because you make sure thatthere is not wastage, that every piece of IT equipment that we deploy will bewell utilized,” he said.

 

液体冷却也是一种无水技术,这将帮助微软实现到本世纪末补水量超过消耗量的承诺。 

Liquid cooling is also a waterless technology,which will help Microsoft meet its commitmentto replenish more water than it consumes bythe end of this decade.

 

流经储罐并使蒸汽凝结的冷却盘管连接到一个单独的闭环系统,该系统使用流体将热量从储罐转移到储罐容器外的干式冷却器。艾丽莎解释说,因为这些盘管中的流体总是比周围的空气更热,所以没有必要喷水来调节空气的蒸发冷却。 

The cooling coils that run through the tank andenable the vapor to condense are connected to a separate closed loop systemthat uses fluid to transfer heat from the tank to a dry cooler outside thetank’s container. Because the fluid in these coils is always warmer than theambient air, there’s no need to spray water to condition the air forevaporative cooling, Alissa explained.

 

微软与基础设施行业合作伙伴一起,也在研究如何以减少流体损失并且对环境几乎没有影响的方式来运行储罐。 

Microsoft, together with infrastructureindustry partners, is also investigating how to run the tanks in ways thatmitigate fluid loss and will have little to no impact on the environment.

 

Azure首席软件工程师伊安尼斯·马努萨基斯(IoannisManousakis)表示:“如果方法得当,两相浸没冷却将同时实现我们所有的成本、可靠性和性能要求,而与空气冷却相比,其能耗基本上只有一小部分。” 

“If done right, two-phase immersion coolingwill attain all our cost, reliability and performance requirementssimultaneously with essentially a fraction of the energy spend compared to aircooling,” said IoannisManousakis, a principal software engineer with Azure.

 

为了冷却数据中心服务器,微软转向沸腾的液体(图4)

 

Microsoft团队正在研究两相浸没式冷却技术。从左至右图:数据中心运营管理部门的Dave Starkenburg,Microsoft数据中心高级开发小组的杰出工程师兼副总裁ChristianBelady,Azure首席软件工程师IoannisManousakis和Microsoft数据中心高级团队的首席硬件工程师HusamAlissa发展。由Gene Twedt为Microsoft摄影。 

A Microsoftteam is exploring two-phase immersion cooling technology. Pictured from left toright: Dave Starkenburg, datacenter operations management, Christian Belady,distinguished engineer and vice president of Microsoft’s datacenter advanceddevelopment group, IoannisManousakis, principal software engineer with Azure,and Husam Alissa, principal hardware engineer on Microsoft’s team fordatacenter advanced development. Photo by Gene Twedt for Microsoft.

 

‘我们把海带到了服务器上’ 

‘We brought the sea to the servers’

 

微软对两相浸没式冷却的深入研究是该公司多管齐下的战略的一部分,该战略旨在使数据中心的构建,运营和维护更加可持续和高效。 

Microsoft’s investigation into two-phaseimmersion cooling is part of the company’s multi-pronged strategy to makedatacenters more sustainable and efficient to build, operate and maintain.

 

例如,数据中心高级开发团队还正在探索使用氢燃料电池代替柴油发电机在数据中心进行备用发电的可能性。 

For example, the datacenter advanceddevelopment team is also exploring the potential to usehydrogen fuel cells instead of diesel generators for backuppower generation at datacenters.

 

液体冷却项目类似于微软的Natick项目,该项目正在探索水下数据中心的可能性,这些数据中心可以快速部署,并且可以在海床上密封于类似潜艇的管状容器内运行数年,而无需人工进行任何现场维护。 

The liquid cooling project is similar to Microsoft’sProject Natick, which is exploring the potential ofunderwater datacenters that are quick to deploy and can operate for years onthe seabed sealed inside submarine-like tubes without any onsite maintenance bypeople.

 

水下数据中心充斥着干燥的氮气空气,而不是特别设计研发的流体。服务器用风扇和热交换管道系统冷却,管道系统通过密封的管道输送海水。 

Instead of an engineered fluid, the underwaterdatacenter is filled with dry nitrogen air. The servers are cooled with fansand a heat exchange plumbing system that pumps piped seawater through thesealed tube.

 

来自Project Natick的一个关键发现是,海底服务器的故障率是陆地数据中心复制服务器故障率的八分之一。初步分析表明,缺乏湿度和氧气的腐蚀作用是水下服务器性能优越的主要原因。 

A key finding from Project Natick is that theservers on the seafloor experienced one-eighth the failure rate of replicaservers in a land datacenter. Preliminary analysis indicates that the lack ofhumidity and corrosive effects of oxygen were primarily responsible for thesuperior performance of the servers underwater.

 

Alissa预计,液浸箱中的服务器将体验到类似的卓越性能。他说:“我们把大海带到了服务器上,而不是把数据中心放在海底。” 

Alissa anticipates the servers inside theliquid immersion tank will experience similar superior performance. “We broughtthe sea to the servers rather than put the datacenter under the sea,” he said.

 

为了冷却数据中心服务器,微软转向沸腾的液体(图5)

 

Azure的首席软件工程师IoannisManousakis从Microsoft数据中心的两相浸入式冷却水箱中卸下了刀片服务器。由Gene Twedt为Microsoft摄影。 

IoannisManousakis,a principal software engineer with Azure, removes a server blade from atwo-phase immersion cooling tank at a Microsoft datacenter. Photo by Gene Twedtfor Microsoft.

 

未来 

The future

 

如果浸没式箱体中的服务器的故障率如预期的那样降低,则Microsoft可以转换到一种模式,即在出现故障时不立即更换组件。这将限制蒸气损失,并允许将油箱部署在偏远且难以维修的位置。 

If the servers in the immersion tank experiencereduced failure rates as anticipated, Microsoft could move to a model wherecomponents are not immediately replaced when they fail. This would limit vaporloss as well as allow tank deployment in remote, hard-to-service locations.

 

此外,Belady指出,能够将服务器密集地包装在储罐中,从而实现了重新构想的服务器体系结构,该体系结构针对低延迟,高性能应用程序和低维护操作进行了优化。 

What’s more, the ability to densely packservers in the tank enables a re-envisioned server architecture that’soptimized for low-latency, high-performance applications as well aslow-maintenance operation, Belady noted.

 

例如,这种箱体可以部署在城市中心的5G蜂窝通信塔下面,用于自动驾驶汽车等应用。

Such a tank, for example, could be deployedunder a 5G cellular communications tower in the middle of a city forapplications such as self-driving cars.

 

到目前为止,Microsoft在超大规模数据中心中只有一个运行工作负载的箱体。在接下来的几个月中,Microsoft团队将进行一系列测试,以证明箱体和这项技术的可行性。 

For now, Microsoft has one tank runningworkloads in a hyperscale datacenter. For the next several months, the Microsoftteam will perform a series of tests to prove the viability of the tank and thetechnology.

 

Belady说:“第一步是让人们对这一概念感到舒适,并表明我们可以运行生产工作负载。” 

“This first step is about making people feelcomfortable with the concept and showing we can run production workloads,”Belady said.