April 29, 2024 — Amid the challenges the U.S. Department of Energy’s Oak Ridge Leadership Computing Facility faced in assembling and launching Frontier, the world’s first exascale-class (more than 1 quadrillion calculations per second) supercomputer, one critical component worked without a hitch.
Critical to the functioning of Frontier is the ability to store the vast amounts of data it generates in its file system, Orion. But even more important to the computational scientists running simulations on Frontier is the ability to quickly write and read from Orion, and to effectively analyze all that data. This is where the Adaptable IO System (ADIOS) comes in.
ADIOS is essentially an input/output (I/O) framework that provides a simple and flexible way for scientists to code the data they need to write, read, or process while their simulations are running. This makes it fairly easy for researchers to analyze the vast amounts of data Frontier generates. Since first being developed at DOE’s Oak Ridge National Laboratory in 2008, the open-source ADIOS framework has become an essential tool for high-performance computing (HPC) simulations around the world, ever-evolving in capabilities and users.
Pictured here with Frontier’s Orion file system are the designers of the Adaptable IO System: Scott Klasky, left, who leads the ADIOS project and leads ORNL’s Workflow Systems Group, and Norbert Podhorszki, right, an ORNL computer scientist who oversees the ongoing development of ADIOS. Photo by Genevieve Martin/ORNL.
“ADIOS has always delivered,” says Bronson Messer, OLCF’s director of science. “At Frontier, we switched to a new parallel file system, and ADIOS made it all work. ADIOS runs well on high-performance computers, so large infrastructure changes like a new file system don’t impact our leadership projects.”
At Leadership Computing Facilities such as OLCF, a DOE Office of Science User Facility at ORNL, ADIOS helps computational scientists analyze the massive amounts of data projects generate even before the data is fully written, enabling scientists to gain early understanding of results as they run simulations.
“Simulations generate all this data, but that doesn’t mean we have to describe it all,” says Scott Klasky, who is leading ADIOS development and leads the Workflow Systems Group in ORNL’s Computer Science and Mathematics Division.
“If you want to get something, you can get it. You can put it in storage, or you can get it from memory and process it. That’s the beauty of ADIOS. ADIOS makes those modalities possible. It was the first technology to allow scientists to process data in motion and data at rest in a unified way, and it remains the fastest technology today.”
The evolution of ADIOS
Klasky began thinking about the need for middleware like ADIOS while working on codes to simulate black holes as a PhD student in physics at the University of Texas at Austin, and later, as a research and development scientist at the Princeton Plasma Physics Laboratory, where he used gyrokinetic toroidal codes to understand turbulent transport in nuclear fusion reactors.
“At the time, we were trying to do something simple: write out a terabyte of data in a day using state-of-the-art, self-describing, parallel I/O systems,” Klasky says. “Some people might think a terabyte doesn’t sound like much now, but in 1999 we were running 1,000 processors on an IBM RS/6000 SP supercomputer at the National Energy Research Scientific Computing Center. To do that with state-of-the-art technology, 50 percent of the compute time was spent on I/O.”
After Klasky arrived at ORNL in 2005, he began seriously pursuing what would become ADIOS. He assembled a team to develop a framework, bringing in researchers from institutions including Georgia Tech and Rutgers University. The project was heavily driven by computer scientist Norbert Podhorski, who was hired by ORNL in 2008 and began developing ADIOS 1.0, which aimed to achieve a tenfold increase in I/O speed for the largest applications running on the OLCF’s Jaguar supercomputer.
“Long before ADIOS, there was always a need for self-describing data to make computational scientists’ jobs easier,” says Podhorszki, “but design bottlenecks meant that performance was not an option, as overall throughput dropped sharply when scaling the application to thousands of processes.
“Everyone in HPC, especially on the big computers here at Oak Ridge, was forced to work in bytes and were trying to build their own data solutions from the ground up – generating output and reading data back at the byte level. It was a huge pain. So we thought, ‘Oh, maybe we can do this better.'”
But by 2015, after 14 major releases, ADIOS’ patchwork of code had become difficult to manage and needed a revamp. DOE’s Exascale Computing Project (ECP), launched in 2016, emerged with funding to hire software engineers to develop the new ADIOS, to prepare software applications and technologies for future exascale systems like Frontier.
“ADIOS 2.0 was born from scratch in 2016, line by line from ADIOS 1.0,” says Podhorszki. “The change in programming language from C to C++ 11 changed everything completely. We had two main goals: first, to redesign and reimplement the product for file system support of upcoming exascale computers, and second, after years of research, to bring staging to production quality for daily use in applications.”
ADIOS 2.9, released at the end of the ECP project, will enable the Frontier supercomputer’s flagship applications to generate and consume multiple terabytes of data per second using the Orion file system.
Say goodbye to science
ADIOS continues to have a lasting impact on computational science through widespread adoption by teams developing or using important simulation codes such as the ECP-enabled Exascale Atomistic Capability for Accuracy, Length, and Time (EXAALT), a molecular dynamics simulation software stack for identifying optimal materials for building nuclear fission and fusion reactors.
Codes using ADIOS include recent winners of the ACM Gordon Bell Prize, one of the computing world’s most prestigious awards. In 2023, the Energy Exascale Earth System Model team, a 19-member team from across the national laboratory complex, won the inaugural ACM Gordon Bell Prize for climate modeling for their Simple Cloud Resolving E3SM Atmosphere project. A year earlier, a 16-member team from Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory, and the French Commission for Alternative Energies and Atomic Energy won the 2022 Maine Gordon Bell Prize for their kinetic plasma simulation code WarpX. These winners also ran their projects on Frontier.
“What is the purpose of a machine like Frontier? Is it to compute faster?” Klasky says. “That’s probably the wrong answer. Anyone can do computations, but what really matters is how the data from those computations are used to advance scientific discovery.”
“If we could generate data very efficiently or process it on the fly without significantly slowing down the computational speed, we could add much more value to these large machines. That’s why we work closely with many application teams around the world. This is an essential element of our success: close partnerships.”
Many of these partnerships stem from DOE’s Scientific Discovery Through Advanced Computing (SciDAC) program, which was created to bring together the nation’s top researchers to develop new computational methods to tackle some of the toughest scientific problems. As part of DOE’s Advanced Scientific Computing Research (ASCR) program, the program partners with other DOE offices and laboratories to fund the development of cutting-edge scientific software.
“A lot of the applications we’re working on, in addition to the ones we’re working on now at ORNL, come from SciDAC,” Klasky said. “We collaborate through the ASCR, and some of the fundamental research we’re doing, like data reduction and querying, are ASCR research proposals. And when we find something that works in a particular app, we say, ‘Can we bring that into ADIOS so we can use it in even more apps?'”
ADIOS for Industry
ADIOS’s capabilities, which allow researchers to quickly write self-describing data to and from storage at scale, are also highly attractive to computational scientists at industrial companies running simulations. As a result, the ADIOS team has frequently helped companies speed up I/O in their code, such as German software company NUMECA’s FINE/Turbo computational fluid dynamics suite for turbomachinery, and property insurer FM Global’s use of OpenFOAM in warehouse fire modeling.
“Working with industry on applications is one of the fun parts of the job,” says Podhorski. “We’re always focused on research, and research has different priorities, so we’re forced to bring together all the things we’ve developed but haven’t had time to integrate into the whole. Making it run smoothly is the number one priority here, so it’s really beneficial to have these contracts over the years and increase the quality of the software overall.”
General Electric Aviation’s current research at OLCF is investigating turbulence and turbine design using home-grown finite element simulations. In this study, Podhorszki’s work with ADIOS resulted in a significant speedup: GE wanted to write 100 terabytes of data per day, but without the dramatically faster I/O, it would have been too costly. Podhorszki aimed for a 100x speedup and got a 500x speedup.
“GE is now able to write a lot more data than we ever expected,” says Klasky. “They don’t have to change their applications, they can just use ADIOS to go further. I think that’s the power of ADIOS.”
What’s next?
With more than one exaflops of computing power, Frontier’s data processing capacity is substantial — approximately 10 petabytes per day. But with that data volume comes new data management challenges.
“The problem changes in the future because the frontier might generate 10 petabytes of data every day, but we can’t process that amount of data efficiently,” Podhorski says. “Then we need to focus on the next problem: We have too much data. What do we do with it? How do we support processing and scientific discovery?”
Klasky envisions a solution that’s comparable to how we can view the thousands of photos we take on our smartphones: Most of the photos aren’t actually stored on our phone, but on a cloud service, but the photo app on our phone gives us a representation of those photos so we can see what they look like and select them for download or sharing.
“Can you provide that experience with massive amounts of data? How do you handle the largest data sets today, say, on a laptop or a cluster,” Klasky said. “I don’t think everybody needs to own Frontier to be able to see what’s in the data, so that’s a big driver of where we’re heading, how do we get there?”
UT-Battelle manages ORNL for the DOE’s Office of Science, the largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit energy.gov/science.
Source: Cooley Turtzi, ORNL