Problem

Molecular dynamics (MD) simulations are notoriously computationally intensive, especially when run with explicit solvent. GPUs have become essential for accelerating MD engines like OpenMM, offering order-of-magnitude speedups over CPUs [1].

SimAtomic collaborated with Shadeform, a marketplace providing access to GPU instances across 20+ cloud providers, to benchmark a range of GPU instance types, including Scaleway and Hyperstack, for molecular dynamics simulations. In addition, we also ran benchmarks on Nebius and AWS.

Our evaluation focused on two key metrics: simulation speed (ns/day) and cost efficiency (cost per 100 ns simulated). One frequently overlooked bottleneck is disk I/O. Writing outputs (trajectories, checkpoints) too frequently can significantly throttle GPU performance, up to 4× slower, primarily due to the overhead of transferring data from GPU to CPU memory. We tested this explicitly and optimized the output interval to mitigate it.

To ensure consistent setup and reproducibility, we used UnoMD [2], an open source python package to simply run MD simulations built on OpenMM. Here we present a high-level comparison of NVIDIA's A100, H100, H200, L40S, V100, and T4 GPUs for running MD, evaluating both raw performance and cost-effectiveness in a cloud context.

Benchmark Setup

We simulated the T4 Lysozyme (PDB ID: 4W52), a medium-sized biomolecular system in explicit water solvent, with (~43,861 atoms total). Simulations were run using UnoMD/OpenMM on a CUDA platform 12.2 with: 2 fs integration timestep, PME electrostatics & 100 ps simulation time in mixed precision.

To avoid I/O-related slowdowns, trajectory frames were saved every 1,000 or 10000 steps (2ps or 20ps), based on tests confirming this interval maintained high GPU utilization [3].

We ran UnoMD's quickrun function using save intervals of 10, 100, 1,000 & 10,000:

# Example: Optimized MD simulation setup >> quickrun(protein_file="unomd/example/4W52.pdb", md_save_interval=1000, nsteps=50000, platform_name="CUDA")

Additional benchmark results showing performance metrics across different GPU configurations.

Effect of Output Frequency on GPU Utilization

As shown in Fig. 1, reducing the frequency of trajectory saving significantly improves GPU utilization and simulation throughput in OpenMM. This is due to the overhead of transferring data from GPU to CPU during each save, which interrupts computation and leaves the GPU idle. Saving less often reduces these interruptions, allowing the GPU to compute more efficiently.

This effect is especially pronounced in short runs (e.g., 100 ps), where saving events represent a larger fraction of the total runtime. In longer simulations, this overhead is amortized over more timesteps, making the impact less severe. Thus, minimizing save frequency is critical for maximizing performance especially in short OpenMM simulations.

GPU utilization vs saving interval comparison

Fig 1) Comparison of GPU utilization vs saving interval (ps) for 100 ps simulation of T4 Lysozyme.

Insight

Our benchmarks show that not all GPUs are equally cost-effective, even if their peak performance is high. As illustrated in Fig. 2, both the H200 and L40S GPUs reach over 500 ns/day, while T4 and V100 top out below 250 ns/day, despite maintaining high GPU utilization (≥90%). However, Fig. 3 highlights a crucial point: raw speed alone does not equate to cost-efficiency.

These findings underscore that cost-effectiveness is shaped by both GPU architecture and cloud pricing, not just simulation speed. While the H100 & H200 excels in raw performance, it is optimized for machine learning and hybrid MD-AI workflows (e.g., using machine-learned force fields). For traditional MD simulations, the L40S remains the most cost-efficient choice, offering top-tier performance at a fraction of the cost. Meanwhile, emerging platforms like Hyperstack or Scaleway, that we accessed through Schadeform, show promising price-performance on A100 and H100 as well.

GPU utilization comparison across different GPUs

Fig 2) Comparison of GPU utilization across GPUs.

Cost per 100 ns comparison across different GPUs

Fig. 3) Cost per 100 ns for simulating T4 Lysozyme (~44K atoms) using OpenMM. AWS T4 (g4dn.2xlarge, 200 GB storage), AWS V100 (p3.2xlarge), H200 (Nebius, 200 GB storage), L40S (Nebius, 200 GB storage), H100 (Scaleway via Shadeform, 3.5 TB storage), A100 (Hyperstack via Shadeform, 800 GB Storage), and H100 (Hyperstack via Schadeform, 800 GB storage). Costs assume a 24-hour runtime and are normalized against T4. GPU utilization was measured at optimized saving intervals.

Disclaimer

All experiments were conducted independently using open-source tools, and the analysis and conclusions presented here reflect our own findings without external influence. GPU compute credits were provided by Shadeform and Nebius.

References

  1. https://blog.salad.com/openmm-gpu-benchmark
  2. UnoMD GitHub – https://github.com/simatomic/unomd
GPU Benchmarking Molecular Dynamics Cloud Computing OpenMM Cost Analysis