New generation of GPGPU and related hardware: computing systems microarchitecture and performance from servers to supercomputers

Mikhail Borisovich Kuzminsky; Кузьминский Михаил Борисович

doi:10.25209/2079-3316-2024-15-2-139-473

New generation of GPGPU and related hardware: computing systems microarchitecture and performance from servers to supercomputers

Authors: Kuzminsky M.B.¹
Affiliations:
1. Zelinsky Institute of Organic Chemistry of RAS
Issue: Vol 15, No 2 (2024)
Pages: 139-473
Section: Articles
URL: https://bakhtiniada.ru/2079-3316/article/view/299202
DOI: https://doi.org/10.25209/2079-3316-2024-15-2-139-473
ID: 299202

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

An overview of the current state of GPGPUs is given, with orientation towards their using to traditional HPC tasks (and less to AI). The basic GPGPUs in the review include Nvidia V100 and A100. Nvidia H100, AMD MI100 and MI200, Intel Ponte Vecchio (Data Center GPU Max), as well as BR100 from Biren Technology are considered as new generation GPGPUs. The important for HPC and AI tasks microarchitecture and hardware features of these GPGPUs, as well as the most important additional hardware for building computer systems with GPGPUs, that are CPUs specialized (albeit only possible for the initial period of their use) for working with the new generation of GPGPUs and interconnects — are analyzed and compared. Brief information is given about the servers (including multi-GPUs) using them, and new supercomputers (using these GPGPUs), where data on the achieved performance when working with GPGPUs was obtained.The SDK of GPGPU manufacturers and software (including mathematical libraries) from other firms are briefly reviewed. Examples are given that demonstrate the tools of widely used programming models that are important for achieving maximum performance, while contributing to the non-portability of program codes to other GPGPU models.Particular attention is paid to the possibilities of using tensor cores and their analogues in modern GPGPUs from other companies, including the possibility of using calculations with reduced (relative to the standard for HPC FP64 format) and mixed precision, which are relevant due to the sharp increase of the achieved performance when using them in GPGPU tensor cores. Data is analyzed on their “real-world” performance in benchmarks and applications for HPC and AI. The use of modern batch linear algebra libraries in GPGPU, including for HPC applications, is also briefly discussed.

Keywords

GPGPU, V100, A100, H100, Grace, GH200 Grace Hopper, MI100, MI200, Ponte Vecchio, Data Center GPU Max, BR100, CUDA, HIP, DPC++, Fortran, HPC, GPGPU, V100, A100, H100, Grace, GH200 Grace Hopper, MI100, MI200, Ponte Vecchio, Data Center GPU Max, BR100, CUDA, HIP, DPC++, Fortran, performance, HPC, AI, deep learning

About the authors

Mikhail Borisovich Kuzminsky

Zelinsky Institute of Organic Chemistry of RAS

Email: kus@free.net
Senior Researcher, Laboratory of Computer Software for Chemical Research, Candidate of Chemical Sciences, Institute of Organic Chemistry, Russian Academy of Sciences. The scientific interests are high-performance computing, computer hardware, computational chemistry.

References

Top500 the list, 61st edition, 2023 URL https://top500.org/lists/green500/2023/06/.
Tschudi W., Xu T., Sartor D., Stein J.. High-performance data centers. A research roadmap, 2004, 53 pp.
Maltenberger T., Ilic I., Tolovski I, Rab T.. “Evaluating multi-GPU sorting with modern interconnects”, SIGMOD'22: Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA, June 12–17, 2022), ACM, New York, 2022, ISBN 978-1-4503-9249-5, pp. 1795–1809.
Top500 the list, 61st edition, 2023 URL https://www.top500.org/lists/top500/2023/06/highs/.
Кузьминский М. Б.. «Современные серверные ARM-процессоры для суперЭВM: A64FX и другие. Начальные данные тестов производительности», Программные системы: теория и приложения, 13:1(52) (2022), с. 63–129.
Gao J., Zheng F., Qi F, Ding Y, Li H., Lu H., He W., Wei H., Jin L., Liu X., Gong D., Wang F., Zheng Y., Sun H., Zhou Z., Liu Y., You H.. “Sunway supercomputer architecture towards exscale computing: analysis and practice”, Science China Information Sciences, 64:4 (2021), 141101, 21 pp.
Selig J.. The cerebras software development kit: A technical overview, Cerebras systems Inc., 2022, 8 pp.
Andromeda, a 13.5 Million Core AI Supercomputer, 2024, a section on the Cerebras company site URL https://www.cerebras.net/andromeda/.
Top500 the list, 61st edition, 2023 URL https://www.top500.org/statistics/list/.
Morgan T. P.. Chip roadmaps unfold, crisscrossing and interconnecting, at AMD, Stackhouse Publishing, 2022 URL https://www.nextplatform.com/2022/06/14/chip-roadmaps-unfold-crisscrossing-and-interconnecting-at-amd/.
Shah A.. Intel Reiterates Plans to Merge CPU, GPU High-performance Chip Roadmaps, HPCwire, 2022 URL https://www.hpcwire.com/2022/05/31/intel-reiterates-plans-to-merge-cpu-gpu-high-performance-chip-roadmaps/.
Morgan T. P.. The Increasingly Graphic Nature Of Intel Datacenter Compute, Stackhouse Publishing, 2022 URL https://www.nextplatform.com/2022/06/08/the-increasingly-graphic-nature-of-intel-datacenter-compute/.
Evans J.. “Nvidia Grace”, 2022 IEEE Hot Chips 34 Symposium (HCS) (Cupertino, CA, USA, 21–23 August 2022), IEEE, 2022, pp. 1–20.
Elster A. C., Haugdahl T. A.. “Nvidia Hopper GPU and Grace CPU Highlights”, Computing in Science and Engineering, 24:2 (2022), pp. 95–100.
Evans J.. Inside Grace, GPU Technology Conference (GTC), Nvidia On-Demand, Nvidia, 2022 URL https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41129/.
CUDA C++ Programming Guide, Nvidia, 2024, 544 pp.
Ampere Tuning Guide, Nvidia, 2024, 22 pp.
Zhang Z., Jiao S., Li J., Wu W., Wan L., Qin X., Hu W., Yang J.. “KSSOLV-GPU: An efficient GPU-enabled MATLAB toolbox for solving the Kohn-Sham equations within density functional theory in plane-wave basis set”, Chinese Journal of Chemical Physics, 34:5 (2021), pp. 552–564.
Giannozzi P., Baseggio O., Bonfà P., Brunato D., Car R., Carnimeo I., Cavazzoni C., de Gironcoli S., Delugas P., Ruffino . F., Ferretti A., Marzari N., Timrov I., Urru A., Baroni S.. “Quantum ESPRESSO toward the exascale”, The Journal of chemical physics, 152:15 (2020), 154105.
Хэ Личжун. Темпы локализации графических процессоров ускоряются, и новые команды продолжают появляться, Capital Securities, Пекин, 2022, 15 с. (Китайский).
Bispo J., Barbosa J., Silva P., Morales C., Myllykoski M., Ojeda-May P., Bialczak M., Uchroński M., Włodarczyk A., Wauligmann P., Krishnasamy E., Varrette S., Lührs S.. Best Practice Guide: Modern Accelerators, ed. Shoukourian H. , PRACE, 2021, 111 pp.
Finkelstein J., Smith J. S., Mniszewski S. M., Barros K., Negre C. F. A., Rubensson E. H., Niklasson A. M. N.. “Quantum-based molecular dynamics simulations using tensor cores”, Journal of Chemical Theory and Computation, 17:10 (2021), pp. 6180–6192.
Posey S., Luitjens J., Hennigh O., Oberlin S.. “GPU-based HPC and AI developments for CFD” (Maui, Hawaii, USA, July 11-15, 2022), 2022, ICCFD11-3803, 5 pp.
Schade R., Kenter T., Elgabarty H., Lass M., Schütt O., Lazzaro A., Pabst H., Mohr S., Hutter J., Kühne T. D., Plessl C.. “Towards electronic structure-based molecular dynamics simulations with hundreds of millions of atoms”, Parallel Computing, 111 (July 2022), 102920, 11 pp.
Terzo O., Martinoviv {c} J (eds.). HPC, Big Data, and AI Convergence Towards Exascale: Challenge and Vision, 1st ed., CRC Press, 2022, ISBN 9781003176664, 322 pp.
Nowicki M., Górski Ł., Ba{ł}a P.. “PCJ Java library as a solution to integrate HPC, Big Data and Artificial Intelligence workloads”, Journal of Big Data, 8:1 (2021), pp. 1–21, 62.
Yin F., Shi F.. “A comparative survey of Big Data computing and HPC: from a parallel programming model to a cluster architecture”, International Journal of Parallel Programming, 50:1 (2022), pp. 27–64.
Yin J., Wang F., Shankar M.. “Strategies for integrating deep learning surrogate models with HPC simulation applications”, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (Lyon, France, 2022), IEEE, 2022, ISBN 978-1-6654-9747-3, pp. 01–10.
Sukumar S. R., Balma J. A., Rickett C. D., Maschhoff K. J., Landman J., Yates C. R., Chittiboyina A. G., Peterson Y. K., Vose A., Byler K., Baudry J., Khan I. A.. “The convergence of HPC, AI and Big Data in rapid-response to the COVID-19 pandemic”, Driving Scientific and Engineering Discoveries Through the Integration of Exeriment, Big Data, and Modeling and Simulation: 21st Smoky Mountains Computational Sciences and Engineering, SMC 2021, Virtual Event, October 18-20, 2021, Revised Selected Papers, Communications in Computer and Information Science, vol. 1512, 2022, ISBN 978-3-030-96497-9, pp. 157-172.
Ejarque J., Badia R. M., Albertin L., f, Aloisio G., Baglione E., Becerra Y., Boschert S., Berlin J. R., D'Anca A., Elia D., Exrtier F., Fiore S., Flich J., Folch A., Gibbons S. J., Koldunov N., Lordan F., Lorito S., Løvholt F., Mac'{i}as J., Volpe M.. “Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence”, Future Generation Computer Systems, 134 (September 2022), pp. 414–429.
Ihde N., Marten P., Eleliemy A., Poerwawinata G., Silva P., Tolovski I., Ciorba F. M., Rabl T.. “A survey of Big Data, High Performance Computing, and Machine Learning benchmarks”, Technology Conference on Performance Evaluation and Benchmarking, Lecture Notes in Computer Science, vol. 13169, Springer, Cham, 2021, ISBN 978-3-030-94436-0, pp. 98–118.
High-Performance Deep Learning Project (HiDL), Ohio state university, NOWLAB: Network Based Computing Lab URL http://hidl.cse.ohio-state.edu.
High-Performance Big Data Project (HiBD), Ohio state university, NOWLAB: Network Based Computing Lab URL http://hidl.cse.ohio-state.edu.
Jeon W., Ko G., Lee J., Lee H., Ha D., Ro W. W.. “Deep learning with GPUs”, Advances in Computers, 122 (2021), pp. 167–215.
Hong M., Xu L.. “Biren BR100 GPGPU: Accelerating Datacenter Scale AI Computing”, 2022 IEEE Hot Chips 34 Symposium (HCS) (Cupertino, CA, USA), 2022, pp. 1–22.
Shilov A.. Russian Company Taps China's Zhaoxin x86 CPU to Replace AMD, Intel CPUs, Tom's Hardware, Future US, New York, 2022 URL https://www.tomshardware.com/news/russian-company-taps-chinas-zhaoxin-x86-cpu-to-replace-amd-intel-cpus.
Shang H., Li F., Zhang Y., Zhang L., Fu Y., Gao Y., Wu Y., Duan X., Lin R., Liu X., Liu Y., Chen D.. “Exreme-scale quantum Raman spectra simulations on the leadership HPC system in China”, SC'21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, New York, November 2021, 13 pp.
Schneider D.. “The Exascale Era is Upon Us: The Frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second”, IEEE Spectrum, 59:1 (January 2022), pp. 34–35.
Dongarra J., Geist A.. Report on the Oak Ridge National Laboratory's Frontier System, Oak Ridge National Laboratory, 2022, Technical Report ICL-UT-22-05 URL https://icl.utk.edu/files/publications/2022/icl-utk-1570-2022.pdfAccessed 15.10.2023.
Frontier Spec Sheet, Oak Ridge National Laboratory, UT-Battelle, 2019, 4 pp.
GPU nodes — LUMI-G, LUMI (Large Unified Modern Infrastructure) consortium, Hardware documentation URL https://docs.lumi-supercomputer.eu/hardware/lumig/.
Markomanolis G. S., Alpay A., Young J., Klemm M., Malaya N., Esposito A., Heikonen J., Bastrakov S., Debus A., Kluge T., Steiniger K., Stephan J., Widera R., Bussmann M.. “Evaluating GPU programming models for the LUMI supercomputer”, Supercomputing Frontiers, Lecture Notes in Computer Science (Asian Conference on Supercomputing Frontiers), vol. 13214, Springer, Cham, 2022, ISBN 978-3-031-10419-0, pp. 79–101.
Aurora, Argonne Leadership Computing Facility, Argonne National Laboratory URL https://www.alcf.anl.gov/aurora.
Peckham O.. LRZ announces new phase of SuperMUC-NG Supercomputer with Intels Ponte Vecchio GPU, Tabor network, HPCwire, 2021 URL https://www.hpcwire.com/2021/05/05/lrz-announces-new-phase-of-supermuc-ng-supercomputer-with-intels-ponte-vecchio-gpu/.
Kwack J. H., Tramm J., Bertoni C., Ghadar Y., Homerding B., Rangel E., Knight C., Parker S.. “Evaluation of performance portability of applications and mini-apps across AMD, Intel and Nvidia GPUs”, 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (14 November 2021, St. Louis, MO, USA), IEEE, 2021, ISBN 978-1-6654-2439-4, pp. 45–56.
HPE Cray Supercomputing EX, Hewlett Packard Enterprise Development LP, 2024 URL https://www.hpe.com/psnow/doc/a00094635enw.
Bertoni C., Parker S.. Aurora overvew, ALCF SDL Workshop (October 6, 2022), 2022, 20 pp.
Morgan T. P.. The NVSwitch Fabric That Is The Hub Of The DGX H100 SuperPOD, Stackhouse Publishing, 2022 URL https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/.
Ishii A., Wells R.. “The Nvlink-Network switch: Nvidia's switch chip for high communication-bandwidth superpods”, 2022 IEEE Hot Chips 34 Symposium (HCS) (21–23 Aug., 2022, Cupertino, CA, USA), IEEE, 2022, ISBN 978-1-6654-6028-6, pp. 1–23.
Eassa A., Ishii A., Wells R.. Upgrading Multi-GPU Interconnectivity with the Third-Generation Nvidia NVSwitch, Nvidia developer, 2022 URL https://developer.nvidia.com/blog/upgrading-multi-gpu-interconnectivity-with-the-third-generation-nvidia-nvswitch.
BR100 series general purpose GPU chip, Biren Technology, Shanghai, 2023 URL https://www.birentech.com/BR10X.html.
Andersch M., Palmer G., Krashinsky R., Stam N., Mehta V., Brito G., Ramaswamy S.. Nvidia Hopper Architecture In-Depth, Nvidia developer, 2022 URL https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
Alcorn P.. From Opteron to Milan: Crusher Supercomputer Comes Online With New AMD CPUs and MI250X GPUs, Tom's Hardware, Future US, New York, 2022 URL https://www.tomshardware.com/news/from-opteron-to-milan-crusher-supercomputer-comes-online-with-amd-cpus-and-gpus.
Intel Xeon CPU Max series product overview, Intel, 2023 URL https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.htmlAccessed 15.10.2023.
Accelerator Processor Stream, European Processor Initiative, 2022 URL https://www.european-processor-initiative.eu/accelerator/.
EPI EPAC1.0 RISC-V test chip samples delivered, European Processor Initiative, 2021, News URL https://www.european-processor-initiative.eu/epi-epac1-0-risc-v-test-chip-samples-delivered/.
Kovav {c} M., Notton P., Hofman D., Knezović J.. “How Europe is preparing its core solution for exascale machines and a global, sovereign, advanced computing platform”, Mathematical and Computational Applications, 25:3 (2020), pp. 46.
HIP Programming Guide, Version 5.0, 2023 URL https://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Programming_with_HIP.htmlAccessed 15.10.2023.
OpenMP Application Programming Interface, Version 5.2, OpenMP Architecture Review Board, 2021, 669 pp.
Khronos OpenCL Registry, Khronos Group, Formatted specifications and other related documentation URL https://registry.khronos.org/OpenCL/.
SYCL 2020 Specification, rev. 6, Khronos Group, 2022, 585 pp.
DPC++ Part 1: An Introduction to the New Programming Model, Intel URL https://www.intel.com/content/www/us/en/developer/videos/dpc-part-1-introduction-to-new-programming-model.html (Accessed 15.10.2023).
Bavarsad N. N., Makrani H. M., Sayadi H., Landis L., Rafatirad S., Homayoun H.. “HosNa: A DPC++ benchmark suite for heterogeneous architectures”, 2021 IEEE 39th International Conference on Computer Design (ICCD) (24–27 October 2021, Storrs, CT, USA), IEEE, 2021, ISBN 978-1-6654-3219-1, pp. 509–516.
Trott C., Berger-Vergiat L., Poliakoff D., Rajamanickam S., Lebrun-Grandie D., Madsen J., Al Awar N., Gligoric M., Shipman G., Womeldorff G.. “The Kokkos EcoSystem: comprehensive performance portability for high performance computing”, Computing in Science & Engineering, 23:5 (2021), pp. 10–18.
Trott C. R., Lebrun-Grandié D., Arndt D., Ciesko J., Dang V., Ellingwood N., Gayatri R., Harvey E., Hollman D. S., Ibanez D., Liber N., Madsen J., Miles J., Poliakoff D., Powell A., Rajamanickam S., Simberg M., Sunderland D., Turcksin B., Wilke J.. “Kokkos 3: Programming model extensions for the exascale era”, IEEE Transactions on Parallel and Distributed Systems, 33:4 (2021), pp. 805–817.
Moore S.. The state of the LAMMPS KOKKOS package, Sandia National Lab, Albuquerque, NM, 2021, SAND2021-9785C URL https://www.osti.gov/servlets/purl/1888676Accessed 15.10.2023.
Ghadar Y., Applencourt T., Homerding B., Harms K., Hammond J.. SYCL Programming Model for Aurora, 2020 ECP Annual Meeting, 2020.
Van Oostrum R., Chalmers N., Mc Dougall D., Bauman P., Curtis N., Malaya N., Wolfe N.. AMD GPU Hardware Basics, Frontier Application Readiness Kick-Off Workshop, 2019, 55 pp.
Intel oneAPI GPU Optimization Guide Release 2022.3, Intel URL https://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-gpu-optimization-guide.pdf (Accessed 15.10.2023).
Khudia D., Huang J., Basu P., Deng S., Liu H., Park J., Smelyanskiy M.. Fbgemm: Enabling high-performance low-precision deep learning inference, 2021, 5 pp.
Carrasco R., Vega R., Navarro C. A.. “Analyzing GPU tensor core potential for fast reductions”, 2018 37th International Conference of the Chilean Computer Science Society (SCCC) (05–09 November 2018, Santiago, Chile), IEEE, 2018, ISBN 9781538692349, pp. 1–6.
Gupta G.. Using Tensor Cores for Mixed-Precision Scientific Computing, Nvidia developer, 2019 URL https://developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/.
Nvidia A100 Tensor Core GPU Architecture, V1.0, Nvidia, 2020, 82 pp.
-2019 — IEEE Standard for Floating-Point Arithmetic, Revision of IEEE 754-2008, 2019, IEEE Std 754-2019, 84 pp.
Kalamkar D., Mudigere D., Mellempudi N., Das D., Banerjee K., Avancha S., Vooturi D. T., Jammalamadaka N., Huang J., Yuen H., Yang J., Park J., Heinecke A., Georganas E., Srinivasan S., Kundu A., Smelyanskiy M., Kaul B., Dubey P.. A study of BFLOAT16 for deep learning training, 2019, 10 pp.
Stosic D., Micikevicius P.. Accelerating AI Training with Nvidia TF32 Tensor Cores, Nvidia developer, 2021 URL https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/.
Micikevicius P., Stosic D., Burgess N., Cornea M., Dubey P., Grisenthwaite R., Ha S., Heinecke A., Judd P., Kamalu J., Mellempudi N., Oberman S., Shoeybi M., Siu M., Wu H.. Fp8 formats for deep learning, 2022, 9 pp.
Nvidia H100 Tensor Core GPU Architecture, V1.04, Nvidia, 2023, Includes final GPU / memory clocks and final TFLOPS performance specs, 71 pp.
Sun W., Li A., Geng T., Stuijk S., Corporaal H.. “Dissecting tensor cores via microbenchmarks: latency, throughput and numerical behaviors”, IEEE Transactions on Parallel and Distributed Systems, 34:1 (2022), pp. 246–261.
Lehmann M., Krause M. J., Amati G., Sega M., Harting J., Gekle S.. “Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats”, Physical Review E, 106:1 (2022), 015308.
Domke J., Matsumura K., Wahib M., Zhang H., Yashima K., Tsuchikawa T., Tsuji Y., Podobas A., Matsuoka S.. “Double-precision FPUs in high-performance computing: an embarrassment of riches{? }” 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (20–24 May 2019, Rio de Janeiro, Brazil), IEEE, 2019, ISBN 978-1-7281-1246-6, pp. 78–88.
Schade R., Kenter T., Elgabarty H., Lass M., Schütt O., Lazzaro A., Pabst H., Mohr S., Hutter J., Kühne T. D., Plessl C.. “Towards electronic structure-based molecular dynamics simulations with hundreds of millions of atoms”, Parallel Computing, 111 (2022), 102920, 11 pp.
Schade R., Kenter T., Elgabarty H., Lass M., Kühne T. D., Plessl C.. Breaking the exascale barrier for the electronic structure problem in molecular dynamics, 2022, 6 pp.
Yu V. W., Govoni M.. GPU acceleration of large-scale full-frequency GW calculations, 2022, 54 pp.
Eriksen J. J.. “Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives”, Molecular Physics, 115:17–18 (2017), pp. 2086–2101.
Ruda D., Turek S., Ribbrock D., Zajac P.. Very fast FEM Poisson solvers on lower precision accelerator hardware, ECCOMAS Congress 2022 (5–9 June 2022, Oslo, Norway), 2022, 24 pp.
Ootomo H., Yokota R.. “Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance”, The International Journal of High Performance Computing Applications, 36:4 (2022), pp. 475–491.
Jain A., Sharma N.. “Accelerated AI inference at CNN-based machine vision in ASICs: A design approach”, ECS Transactions, 107:1 (2022), pp. 5165.
Gallet B., Gowanlock M.. Computing double precision Euclidean distances using GPU tensor cores, 2022, 10 pp.
Domke J., Vatai E., Drozd A., Chen P. T, Oyama Y., Zhang L., Salaria S., Mukunoki D., Podobas A., Wahib M. T, Matsuoka S.. “Matrix engines for high performance computing: A paragon of performance or grasping at straws?” 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (17–21 May 2021, Portland, OR, USA), IEEE, 2021, ISBN 978-1-6654-4066-0, pp. 1056–1065.
Tan H., Yan R., Yang L., Huang L., Xiao L., Yang Q.. “Efficient multiple-precision and mixed-precision floating-point fused multiply-accumulate unit for HPC and AI applications”, Algorithms and Architectures for Parallel Processing, 22nd International Conference ICA3PP 2022 (Copenhagen, Denmark, October 10–12, 2022), Lecture Notes in Computer Science, vol. 13777, Springer Nature Switzerland, Cham, 2023, ISBN 978-3-031-22676-2, pp. 642–659.
Эксклюзивное интервью с руководителями Biren Technology: деконструкция первого 7-нм графического процессора компании, 2022, Обзор от компании MooreElite.com (Hefei) URL https://caifuhao.eastmoney.com/news/20220812093829803631950 (Китайский).
Nvidia A100 Tensor Core GPU Datasheet, V1.0, Nvidia, 2020, 3 pp.
Choquette J., Lee E., Krashinsky R., Balan V., Khailany B.. “3.2 The A100 Datacenter GPU and Ampere Architecture”, 2021 IEEE International Solid-State Circuits Conference (ISSCC) (13–22 February 2021, San Francisco, CA, USA), IEEE, 2021, ISBN 9781728195506, pp. 48–50.
Nvidia A100 tensor core GPU architecture, V1.0, Nvidia, 2020, 82 pp.
Hassanpour M., Riera M., González A.. “A survey of near-data processing architectures for neural networks”, Machine Learning and Knowledge Extraction, 4:1 (2022), pp. 66–102.
Gómez-Luna J., Guo Y., Brocard S., Legriel J., Cimadomo R., Oliveira G. F., Singh G., Mutlu O.. An experimental evaluation of machine learning training on a real processing-in-memory system, 2022, 21 pp.
Niu D., Li S., Wang Y., Han W., Zhang Z., Guan Y., Guan T., Sun F., Xue F., Duan L., Fang Y., Zheng H., Jiang X., Wang S., Zuo F., Wang Y., Yu B., Ren Q., Xie Y.. “184QPS/W 64Mb/mm$^2$3D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system”, IEEE International Solid-State Circuits Conference (ISSCC) (20–26 February 2022, San Francisco, CA, USA), IEEE, 2022, pp. 1–3.
BiLi 106M, Biren Technology, Shanghai, 2020–2023, Product details URL https://www.birentech.com/product_details/1005557637772464128.html.
BiLi 106B, 106C, Biren Technology, Shanghai, 2020–2023 URL https://www.birentech.com/product_details/1005557844745474048.html.
Blankenship R., Wagh M.. Introducing the CXL 3.1 Specification, Compute express link consortium, 2022, 27 pp.
Coughlin T.. “Digital storage and memory”, Computer, 55:1 (2022), pp. 20–29.
Nvidia A100 Tensor Core GPU Datasheet, Nvidia, 2021, 3 pp.
Ampere Tuning Guide, Release 12.4, Nvidia, 2024, 22 pp.
Server/OAI, Open computers project, Wiki page URL https://www.opencompute.org/wiki/Server/OAI.
Nvidia DGX A100, Nvidia, 2023, Datasheet, 2 pp.
Morgan T. P.. China launches the inevitable indigenous GPU, Stackhouse Publishing, 2022 URL https://www.nextplatform.com/2022/08/25/china-launches-the-inevitable-indigenous-gpu/.
BIRENSUPA software development platform, Biren Technology, Shanghai, 2023, Product details URL https://www.birentech.com/product_details/1005588957219246080.html.
MLPerf inference: datacenter benchmark suite results, MLCommons URL https://mlcommons.org/en/inference-datacenter-21/.
Reddi V. J., Cheng C., Kanter D., Mattson P., Schmuelling G., Carole-Wu J., Anderson B., Breughe M., Charlebois M., Chou W., Chukka R., Coleman C., Davis S., Deng P., Diamos G., Duke J., Fick D., Gardner J. S., Hubara I., Idgunji S., Jablin T. B., Jiao J., John T. S., Kanwar P., Lee D., Liao J., Lokhmotov A., Massa F., Meng P., Micikevicius P., Osborne C., Pekhimenko G., Rajan A. T. R., Sequeira D., Sirasao A., Sun F., Tang H., Thomson M., Wei F., Wu E., Xu L., Yamada K., Yu B., Yuan G., Zhong A., Zhang P., Zhou Y.. “Mlperf inference benchmark”, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) (30 May 2020–03 June 2020, Valencia, Spain), IEEE, 2020, ISBN 978-1-7281-4661-4, pp. 446–459.
Saad M. H., Hashima S., Sayed W., El-Shazly E. H., Madian A. H., Fouda M. M.. “Early diagnosis of COVID-19 images using optimal CNN hyperparameters”, Diagnostics, 13:1 (2023), 76.
Devlin J., Ming-Chang W., Lee K., Toutanova K.. “BERT: Pre-training of deep bidirectional transformers for language understanding”, Human Language Technology: Conference of the North American Chapter of the Association of Computational Linguistics. 1, NAACL-HLT 2019 (June 2–June 7, 2019, Minneapolis, Minnesota, USA), ACL, 2019, ISBN 978-1-950737-13-0, pp. 4171–4186.
Nvidia TensorRT, an SDK for high-performance deep learning inference, Nvidia developer, Nvidia, Web site URL https://developer.nvidia.com/tensorrt.
Blythe D.. “The X$^e$ GPU architecture”, 2020 IEEE Hot Chips 32 Symposium (HCS) (16–18 August 2020, Palo Alto, CA, USA), IEEE, 2020, ISBN 978-1-7281-7129-6, pp. 1–27.
Blythe D.. “X$^e$HPC Ponte Vecchio”, 2021 IEEE Hot Chips 33 Symposium (HCS) (22–24 August 2021, Palo Alto, CA, USA), IEEE, 2021, ISBN 978-1-6654-1397-8, pp. 1–34.
Intel data center GPU Max series product brief, Intel URL https://www.intel.com/content/dam/www/central-libraries/us/en/documents/2023-01/data-center-gpu-max-series-product-brief.pdf (Accessed 15.10.2023).
Intel data center GPU flex series product brief, Intel URL https://www.intel.com/content/dam/www/central-libraries/us/en/documents/2022-08/ats-m-product-brief-final.pdf (Accessed 15.10.2023).
Dhote D., Virmani C., Krishna K. G., Raghav S.. “The science of ray tracing”, International Journal of Computer Applications, 176:42 (2020), pp. 15–20.
Intel data center GPU Max series, Intel URL https://ark.intel.com/content/www/us/en/ark/products/series/232874/intel-data-center-gpu-max-series.html (Accessed 15.10.2023).
Jiang H.. “Intel's Ponte Vecchio GPU: architecture, systems and software”, 2022 IEEE Hot Chips 34 Symposium (HCS) (21–23 August 2022, Cupertino, CA, USA), IEEE, 2022, ISBN 978-1-6654-6028-6, pp. 1–29.
Sidorova M., Gorbushin L., Koneva N.. “Analytical review of electronic devices of modern supercomputing systems”, Proceedings of the International Russian Automation Conference, RusAutoCon2021 (September 5-11, 2021, Sochi, Russia), Lecture Notes in Electrical Engineering, vol. 857, Springer, Cham, 2022, ISBN 978-3-030-94201-4, pp. 25–33.
Tian W., Li B., Li Z., Cui H., Shi J., Wang Y., Zhao J.. “Using chiplet encapsulation technology to achieve processing-in-memory functions”, Micromachines, 13:10 (2022), pp. 1790.
Moore S. K.. “3 paths to 3D processors”, IEEE Spectrum, 59:6 (2022), pp. 24–29.
Zhang S., Li Z., Zhou H., Li R., Wang S., Kyung-Paik W., He P.,. “Recent prospectives and challenges of 3D heterogeneous integration”, e-Prime-Advances in Electrical Engineering, Electronics and Energy, 2022, 100052.
Hadidi R., Asgari B., Mudassar B. A., Mukhopadhyay S., Yalamanchili S., Kim H.. “Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube”, 2017 IEEE international symposium on Workload characterization (IISWC) (01–03 October 2017, Seattle, WA, USA), IEEE, 2017, pp. 66–75.
Ma X., Wang Y., Wang Y., Cai X., Han Y.. “Survey on chiplets: interface, interconnect and integration methodology”, CCF Transactions on High Performance Computing, 2022, no.4, pp. 43–52.
Universal chiplet interconnect express specifications, Universal Chiplet Interconnect Express, 2023 URL https://www.uciexpress.org/specification.
Gomes W., Koker A., Stover P., Ingerly D., Siers S., Venkataraman S., Pelto C., Shah T., Rao A., O' .,Mahony, Karl E., Cheney L., Rajwani I., Jain H., Cortez R., Chandrasekhar A., Kanthi B., Koduri R.. “Ponte Vecchio: A multi-tile 3D stacked processor for exascale computing”, 2022 IEEE International Solid-State Circuits Conference (ISSCC) (20–26 February 2022, San Francisco, CA, USA), IEEE, 2022, ISBN 978-1-6654-2800-2, pp. 42–44.
Gomes W., Koker A., Stover P., Ingerly D., Siers S., Venkataraman S., Pelto C., Shah T., Rao A., O'Mahony F., Karl E., Cheney L., Rajwani I., Jain H., Cortez R., Chandrasekhar A., Kanthi B., Koduri R.. Ponte Vecchio: A multi-tile 3D stacked processor for exascale computing, HPC user forum, Accelerated Computing Systems and Graphics Group, 2021 URL https://www.hpcuserforum.com/wp-content/uploads/2021/05/Gomes_Intel_Ponte-Vecchio_Mar2022-HPC-UF.pdf.
Intel data center GPU Max series technical overview, Intel, 2023 URL https://www.intel.com/content/www/us/en/developer/articles/technical/intel-data-center-gpu-max-series-overview.html (Accessed 15.10.2023).
Moore S. K.. Behind Intel's HPC chip that will pierce the exascale barrier, IEEE Spectrum, IEEE, 2022, Blog URL https://spectrum.ieee.org/intel-s-exascale-supercomputer-chip-is-a-master-class-in-3d-integration.
Ingerly D. B., Amin S., Aryasomayajula L., Balankutty A., Borst D., Chandra A., Cheemalapati K., Cook C. S., Criss R., Enamul K., Gomes W., Jones D., Kolluru K. C., Kandas A., G.-Kim S., Ma H., Pantuso D., Petersburg C. F., Phen-givoni M., Pillai A. M., Sairam A., Shekhar P., Sinha P., Stover P., Telang A., Zell Z.. “Foveros: 3D integration and the use of face-to-face chip stacking for logic devices”, 2019 IEEE International Electron Devices Meeting (IEDM) (07–11 December 2019, San Francisco, CA, USA), IEEE, 2019, ISBN 978-1-7281-4033-9, pp. 19.6.1-19.6.4.
Mahajan R., Sankman R., Patel N., Dae-Kim W., Aygun K., Qian Z., Mekonnen Y., Salama I., Sharan S., Iyengar D., Mallik D.. “Embedded multi-die interconnect bridge (EMIB)–a high density, high bandwidth packaging interconnect”, 2016 IEEE 66th Electronic Components and Technology Conference (ECTC) (31 May 2016–03 June 2016, Las Vegas, NV, USA), IEEE, 2016, pp. 557–565.
Irani S.. Hang SK Intel Ponte Vecchio compute accelerator OAM product and system, 2021 OCP Global Summit, 2021 URL https://www.opencompute.org/events/past-events/2021-ocp-global-summit.
Tekin A., A.Durak T., Piechurski C., Kaliszan D.,Sungur F. A., Robertsén F., Gschwandtn P.. State-of-the-art and trends for computing and interconnect network solutions for HPC and AI, PRACE, 2021, Partnership for Advanced Computing in Europe, 38 pp.
Sun W., Li A., Geng T., Stuijk S., Corporaal H.. “Dissecting tensor cores via microbenchmarks: latency, throughput and numerical behaviors”, IEEE Transactions on Parallel and Distributed Systems, 34:1 (2023), pp. 246–261.
Intel Products formerly Alchemist, Intel URL https://ark.intel.com/content/www/us/en/ark/products/codename/226095/products-formerly-alchemist.html (Accessed 15.10.2023).
Watts D.. Lenovo ThinkSystem and ThinkAgile GPU Summary, Lenovo press, 2024, Product Guide, 71 pp.
Liu Zh.. Intel Axes Data Center GPU Max 1350, Preps New Max 1450 for 'Different Markets', Tom's Hardware, Future US, New York, 2023 URL https://www.tomshardware.com/news/intel-axes-data-center-gpu-max-1350-preps-max-1450-for-different-markets.
Vuduc R., Chandramowlishwaran A., Choi J., Guney M.(E.), Shringarpure A.. “On the limits of GPU acceleration”, Proceedings of the 2nd USENIX conference on Hot topics in parallelism, HotPar'10 (June 14–15, 2010, Berkeley, CA, USA), USENIX Association, Berkeley, 2010, 6 pp.
Hanindhito B., Gourounas D., Fathi A.,Trenev D., Gerstlauer A., John L. K.. “GAPS: GPU-acceleration of PDE solvers for wave simulation”, ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing (June 28–30, 2022, Virtual Event), ACM, NeW York, 2022, ISBN 978-1-4503-9281-5, 13 pp.
Chalmers N., Mishra A., Mc Dougall D., Warburton T.. “HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark”, The International Journal of High Performance Computing Applications, 37:5 (2023), pp. 560-577.
Philippe J.-L.. Intel HW roadmap and architecture specifics, OneAPI workshop with FocusCoE, 2022, 48 pp.
Min M., Yu-Lan H., Fischer P., Rathnayake T., Holmen J.. Nek5000/RS Performance on Advanced GPU Architectures, Argonne National Lab.(ANL), Argonne, IL, 2022., ANL-22/81, 30 pp.
oneAPI GPU Optimization Guide, edition 2023.1, Intel, 2023, 411 pp.
Blythe D.. “XeHPC ponte vecchio”, 2021 IEEE hot chips 33 symposium (HCS), 2021, pp. 1–34.
van der Steen A. J., “Overview of recent supercomputers”: Dongarra J. J. Van der Steen A. J., High-performance computing systems: Status and outlook, Acta Numerica, vol. 21, 2012, pp. 379–474 URL https://www.researchgate.net/publication/259421001_High-performance_computing_systems_Status_and_outlook/link/54f467380cf2f9e34f0a2083/download?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIn19.
Intel Xeon CPU Max Series, Intel., 2023, Product Brief, 3 pp.
Shipman G. M., Swaminarayan S., Grider G., Lujan J., Zerr R. J.. Early performance results on 4th Gen Intel Xeon scalable processors with DDR and Intel Xeon processors, codenamed sapphire rapids with HBM, 2022, 5 pp.
SiPearl: collaboration with Intel to accelerate exascale supercomputing deployment in Europe, The Silicon Pearl, Press release, 2 pp.
Parker S.. Future ALCF Systems, 2021 ALCF Computational Performance Workshop, Argonne National Laboratory, 2021, 25 pp.
Ghadar Y., Williams T.. An Overview of Aurora, Argonnes Upcoming Exascale System, ALCF Developer Session (December 11 2019), 2020, 45 pp.
SYCL, Khronos Group, The SYCL main page URL https://www.khronos.org/sycl/.
Casta {n}o G., aqir-Rhazoui Y., Garc'{i}a C., Prieto-Mat'{i}as M.. “Evaluation of Intel's DPC++ Compatibility Tool in heterogeneous computing”, Journal of Parallel and Distributed Computing, 165 (2022), pp. 120–129.
Intel oneAPI 2023 Release: Preview the Tools, Intel URL https://www.intel.com/content/www/us/en/developer/videos/intel-oneapi-2023-release-preview.html (Accessed 15.10.2023).
Intel oneAPI Plug-Ins from Codeplay for Nvidia and AMD GPUs, Intel URL https://www.intel.com/content/www/us/en/developer/videos/oneapi-plug-ins-codeplay-nvidia-amd-gpus.html (Accessed 15.10.2023).
oneAPI DPC++ Compiler documentation, Intel, 2024, LLVM documentation URL https://intel.github.io/llvm-docs/.
Benchmarking the performance of oneAPI on heterogeneous computing Platforms, Moasys, Intel Software, 2021, webinar slides, 30 pp.
OneAPI Specifications, Unified Acceleration Foundation, 2024 URL https://www.oneapi.io/spec/.
Wang Z., Plyakhin Y., Sun C., Zhang Z., Jiang Z., Huang A., Wang H.. “A source-to-source CUDA to SYCL code migration tool: Intel DPC++ Compatibility Tool”, IWOCL '22: Proceedings of the 10th International Workshop on OpenCL (May 10–12, 2022, Bristol, United Kingdom), ACM, New York, 2022, ISBN 978-1-4503-9658-5, 2 pp.
Fortenberry A., Tomov S.. “Extending MAGMA portability with OneAPI”, 2022 Workshop on Accelerator Programming Using Directives (WACCPD) (13–18 November, 2022, Dallas, TX, USA), IEEE, 2022, ISBN 978-1-6654-9019-1, pp. 22–31.
Hardy D. J., Choi J., Jiang W., Tajkhorshid E.. “Experiences porting NAMD to the Data Parallel C++ programming model”, IWOCL '22: Proceedings of the 10th International Workshop on OpenCL (May 10–12, 2022, Bristol, United Kingdom), ACM, New York, 2022, ISBN 978-1-4503-9658-5, 5 pp.
Alekseenko A., Páll S., Lindahl E.. “Experiences with adding SYCL support to GROMACS”, IWOCL '21: Proceedings of the 9th International Workshop on OpenCL (April 27–29, 2021, Munich, Germany), ACM, New York, pp. 1.
GROMACS Highlights, GROMACS development team, 2023 URL https://manual.gromacs.org/current/release-notes/2023/major/highlights.html.
Alpay A., Soproni B., Wünsche H., Heuveline V.. “Exploring the possibility of a hipSYCL-based implementation of oneAPI”, IWOCL '22: Proceedings of the 10th International Workshop on OpenCL (May 10–12, 2022, Bristol, United Kingdom), ACM, New York, 2022, ISBN 978-1-4503-9658-5, 12 pp.
Sakiotis I., Arumugam K., Paterno M., Ranjan D., Terzić B., Zubair M., High Performance Computing, 38th International Conference ISC High Performance 2023 (May 21–25, 2023, Hamburg, Germany), Lecture Notes in Computer Science, vol. 13948, Springer, Cham, 2023, ISBN 978-3-031-32040-8, pp. 339–358.
Reguly I. Z., Owenson A. M. B., Powell A., Jarvis S. A., Mudalige G. R.. “Under the hood of SYCL — an initial performance analysis with an unstructured-mesh CFD application”, High Performance Computing, 36th International Conference ISC High Performance 2021 (June 24–July 2, 2021, Virtual Event), Lecture Notes in Computer Science, vol. 12728, Springer, Cham, 2021, ISBN 978-3-030-78712-7, pp. 391–410.
Walden A. C., Zubair M., Nielsen E. J.. “Performance and portability of a linear solver across emerging architectures”, Accelerator Programming Using Directives, 7th International Workshop WACCPD 2020 (November 20, 2020, Virtual Event), Lecture Notes in Computer Science, vol. 12655, Springer, Cham, 2021, ISBN 978-3-030-74223-2, pp. 61–79.
Zubair M., Stone C., Walden A., Nielsen E.. Experiences in Moving CUDA-Optimized Kernels to Intel GPUs using oneAPI, SC21, 2021, 22 pp.
Nvidia HPC Compilers User's Guide, version 2023, DU-09862-001-V2023, 173 pp.
Karp M., Massaro D., Jansson N., Hart A., Wahlgren J., Schlatter P., Markidis S.. “Large-scale direct numerical simulations of turbulence using GPUs and modern Fortran”, The International Journal of High Performance Computing Applications, 37:5 (2023), pp. 487–502.
OpenMP Compilers & Tools, OpenMP, 2023 URL https://www.openmp.org/resources/openmp-compilers-tools/.
Cojean T., Tsai Y. H. M., Anzt H.. “Ginkgo—A math library designed for platform portability”, Parallel Computing, 111 (2022), 102902.
Fuentes J., López D., González S.. “Teaching heterogeneous computing using DPC++”, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (30 May 2022–03 June 2022, Lyon, France), IEEE, 2022, ISBN 978-1-6654-9747-3, pp. 354–360.
Fridman Y., Tamir G., Oren G.. “Portability and scalability of OpenMP offloading on state-of-the-art accelerators”, High Performance Computing, ISC High Performance 2023 International Workshops (May 21–25, 2023, Hamburg, Germany), Lecture Notes in Computer Science, vol. 13999, Springer, Cham, 2023, ISBN 978-3-031-40842-7, pp. 378–390.
Reguly I. Z.. “Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications”, Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W'23 (November 12–17, 2023, Denver, CO, USA), ACM, New York, 2023, ISBN 979-8-4007-0785-8, pp. 1038–1047.
SPEC CPU2017 Results, Standard Performance Evaluation Corporation URL https://www.spec.org/cpu2017/results.
Solis-Vasquez L., Mascarenhas E., Koch A.. “Experiences migrating CUDA to SYCL: A molecular docking case study”, Proceedings of the 2023 International Workshop on OpenCL, IWOCL'23 (April 18–20, 2023, Cambridge, United Kingdom), ACM, New York, 2023, ISBN 979-8-4007-0745-2, pp. 1–11.
Nguyen P., Nayak P., Anzt H.. Porting batched iterative solvers onto Intel GPUs with SYCL, SC-W '23 (November 12–17, 2023, Denver, CO, USA), ACM, New York, 2023, ISBN 979-8-4007-0785-8.
Morgan T. P.. One New Feature For Intel's HPC Compute Engines: Contrition, The Next Platform, Stackhouse Publishing, 2022 URL https://www.nextplatform.com/2022/11/09/one-new-feature-for-intels-hpc-compute-engines-contrition/.
Morgan T. P.. Aurora In A Socket: What Intel's Falcon Shores XPU Might Do, The Next Platform, Stackhouse Publishing, 2022 URL https://www.nexplatform.com/2022/02/28/aurora-in-a-socket-what-intels-falcon-shores-xu-might-do/.
Nvidia Parallel Thread Execution ISA, Release 8.4, Nvidia, 2024, 598 pp.
Inline PTX Assembly in CUDA, vol. 1, Release 12.4, Nvidia, 2024, 16 pp.
Nvidia Ampere GA102 GPU Architecture, V2.0, Nvidia, 2021, Updated with Nvidia RTX A6000 and Nvidia A40 Information, 53 pp.
GPU Specs Database, TechPowerUp, A reference list of most graphics cards released in recent years URL https://www.techpowerup.com/gpu-specs/.
Osama M., Merrill D., Cecka C., Garland M., Owens J. D.. Stream-K: Work-centric parallel decomposition for dense matrix-matrix multiplication on the GPU, PPoPP'23 (25 February 2023–1 March 2023, Montreal, QC, Canada), ACM, New York, 2023, ISBN 979-8-4007-0015-6.
Nsight Compute, v2024.1.1, Nvidia, The User Guide for Nsight Compute URL https://docs.nvidia.com/nsight-compute/NsightCompute/.
Li A., Song S. L., Wijtvliet M., Kumar A., Corporaal H.. “SFU-driven transparent approximation acceleration on GPUs”, Proceedings of the 2016 International Conference on Supercomputing, ICS'16 (June 1–3, 2016, Istanbul, Turkey), ACM, New York, 2016, ISBN 978-1-4503-4361-9, pp. 1–14.
Jia Z., Maggioni M., Staiger B., Scarpazza D. P.. Dissecting the Nvidia volta GPU architecture via microbenchmarking, 2018, 66 pp.
Choquette J., Gandhi W., Giroux O., Stam N., Krashinsky R.. “Nvidia A100 tensor core GPU: performance and innovation”, IEEE Micro, 41:2 (2021), pp. 29–35.
Multi-Instance GPU User Guide, RN-08625-v2.0, Nvidia, 2024, iv+53 pp.
Nvidia A100 80GB PCIe GPU, PB-10577-001_v03, Nvidia, 2022, Product Brief URL https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/PB-10577-001_v02.pdf.
Li A., Song S. L., Chen J., Li J., Liu X., Tallent N. R., Barker K. J.. “Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect”, IEEE Transactions on Parallel and Distributed Systems, 31:1 (2019), pp. 94–110.
Lutz C., Bre{ss} S., Zeuch S., Rabl T., Markl V.. “Pump up the volume: Processing large data on GPUs with fast interconnects”, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD'20 (June 14–19, 2020, Portland, OR, USA), ACM, New York, 2020, ISBN 978-1-4503-6735-6, pp. 1633–1649.
Nvidia NVSwitch, Nvidia, 2018, Technical Overview, 8 pp.
Špeťko M., Vysocký O.,Jans'{i}k B.,Ř'{i}ha L.. “DGX-A100 Face to Face DGX-2—performance, power and thermal behavior evaluation”, Energies, 14:2 (2021), 376, 18 pp.
Choi Y. R., Nikolskiy V., Stegailov V.. “Matrix-matrix multiplication using multiple GPUS connected by Nvlink”, 2020 Global Smart Industry Conference (GloSIC) (17–19 November 2020, Chelyabinsk, Russia), IEEE, 2020, ISBN 9781728180755, pp. 354–361.
Manathunga M., Jin C., Cruzeiro V. W. D., Miao Y., Mu D., Arumugam K., Keipert K., Aktulga H. M., Merz jr K. M., Götz A. W.. “Harnessing the power of multi-GPU acceleration into the quantum interaction computational kernel program”, Journal of Chemical Theory and Computation, 17:7 (2021), pp. 3955–3966.
Choi Y. R., Nikolskiy V., Stegailov V.. “Matrix-matrix multiplication using multiple GPUS connected by Nvlink”, 2020 Global Smart Industry Conference (GloSIC) (17–19 November 2020, Chelyabinsk, Russia), IEEE, 2020, ISBN 9781728180755, pp. 354–361.
Choi Y. R., Stegailov V.. “Multi-GPU GEMM algorithm performance analysis for Nvidia and AMD GPUs connected by NVLink and PCIe”, Mathematical Modeling and Supercomputer Technologies: 22nd International Conference, Revised Selected Papers, 22nd International Conference MMST 2022 (November 14–17, 2022, Nizhny Novgorod, Russia), Springer, Cham, 2022, ISBN 978-3-031-24144-4, pp. 281–292.
Nvidia DGX A100, Nvidia, 2023, Datasheet, 2 pp.
Nvidia DGX A100, DU-09821-001_v01, Nvidia, 2023, User Guide, 126 pp.
Nvidia DGX Platform, Cloud & Data Center, Nvidia, 2024 URL https://www.nvidia.com/en-us/data-center/dgx-platform/.
Nvidia DGX SuperPOD: Scalable Infrastructure for AI Leadership, RA-09950-001, Nvidia, 2021, Reference Architecture, 30 pp.
Leonardo HPC System, Leonardo Pre-exascale Supercomputer, 2024, Technical info URL https://leonardo-supercomputer.cineca.eu/hpc-system/.
Perlmutter Architecture, NERSC, NERSC documentation URL https://docs.nersc.gov/systems/perlmutter/architecture/.
Nvidia HPC SDK, Containers, Nvidia, Overview URL https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc.
Nvidia HPC SDK Version 24.3 Documentation, Nvidia, 2024 URL https://docs.nvidia.com/hpc-sdk/.
CUDA LLVM Compiler, Nvidia Developer, Nvidia URL https://developer.nvidia.com/cuda-llvm-compiler.
Potluri S., Hamidouche K., Venkatesh A., Bureddy D., Panda K.. “Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with Nvidia GPUs”, 2013 42nd International Conference on Parallel Processing (01–04 October 2013, Lyon, France), IEEE, 2013, ISBN 978-0-7695-5117-3, pp. 80–89.
Nvidia CUDA Fortran programming guide, HPC SDK documentation, version 24.3, Nvidia, 2024 URL https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/.
Ruetsch G., Fatica M.. CUDA Fortran for scientists and engineers. Best practices for efficient CUDA Fortran programming, 1st ed., Morgan Kaufmann, 2013, ISBN 978-0-12-416970-8, 338 pp.
Oyanagi S.. HPE Cray MPI update, SC'21 ANL MPICH BOF (November 17, 2021), Slides, 6 pp.
Developing a Linux kernel module using GPUDirect RDMA, v12.4, Nvidia, 2024, 48 pp.
MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, RoCE, and Slingshot, Version 2.3.7, ed. Dhabaleswar K. , NBCL, 2024, User guide URL https://mvapich.cse.ohio-state.edu/userguide/gdr/.
Bhalachandra S., Austin B., Williams S., Wright N. J.. Understanding the impact of input entropy on FPU, CPU, and GPU power, 2022, 6 pp.
Nvidia Docs. Matrix multiplication background, Deep learning, DU-09799-001_v001, Nvidia, 2023, User's Guide, 17 pp.
Mukunoki D., Ozaki K., Ogita T., Imamura T.. “DGEMM using tensor cores, and its accurate and reproducible versions”, High Performance Computing, 35th International Conference, ISC High Performance 2020 (June 22–25, 2020, Frankfurt/Main, Germany), Lecture Notes in Computer Science, vol. 12151, Springer, Cham, 2020, ISBN 978-3-030-50742-8, pp. 230–248.
Fasi M., Higham N. J., Mikaitis M., Pranesh S.. “Numerical behavior of Nvidia tensor cores”, PeerJ Computer Science, 2021, 7e330.
Fasi M., Higham N. J., Lopez F., Mary T., Mikaitis M.. “Matrix multiplication in multiword arithmetic: error analysis and application to GPU tensor cores”, SIAM Journal on Scientific Computing, 45:1 (2023), pp. C1–C19.
Li S., Osawa K., Hoefler T.. Efficient quantized sparse matrix operations on tensor cores, 2022, 13 pp.
Tao D., Tiuan J.. Performance of Sample CUDA Benchmarks on Nvidia Ampere A100 vs Tesla V100, GitHub Inc., 2021 URL https://github.com/dingwentao/CUDA-benchmark-performance-on-A100.
CUDA Samples, TRM-06704-001_v11.2, Reference Manual, 142 pp.
All ACCEL Results Published by SPEC, Standard Performance Evaluation Corporation, 2024 URL http://spec.org/accel/results/accel_acc.html.
Brunst H., Chandrasekaran S., Ciorba F., Hagerty N., Henschel R., Juckeland G., Li J., Vergara V. G. M., Wienke S., Zavala M.. First experiences in performance benchmarking with the new SPEChpc 2021 suites (16-19 May 2022, Taormina, Italy), 2022, ISBN 978-1-6654-9956-9.
All HPC2021 Results Published by SPEC, SPEC, 2024 URL http://spec.org/hpc2021/results/hpc2021.html.
Svedin M., Chien S. W. D., Chikafa G., Jansson N., Podobas A.. “Benchmarking the Nvidia GPU lineage: From early K80 to modern A100 with asynchronous memory transfers”, Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (June 21–23, 2021, Online Germany), ACM, New York, 2021, ISBN 978-1-4503-8549-7, 6 pp.
Zhang L., Wahib M., Chen P., Meng J., Wang X., Endo T., Matsuoka S.. Persistent kernels for iterative memory-bound GPU applications, 2022.
Špeťko M., Vysocký O., Jans'{i}k B., Ř'{i}ha L.. “DGX-A100 face to face DGX-2—performance, power and thermal behavior evaluation”, Energies, 14:2 (2021), pp. 376.
Jansik B.. Mandelbrot benchmark URL https://code.it4i.cz/jansik/mandelbrotAccessed 15.10.2023.
Mudigere D., Hao Y., Huang J., Jia Z., Tulloch A., Sridharan S., Liu X., Ozdal M., Nie J., Park J., Luo L., Yang J. A., Gao L., Ivchenko D., Basant A., Hu Y., Yang J., Ardestani E. K., Wang X., Komuravelli R., Ching-Chu H., Yilmaz S., Li H., Qian J., Feng Z., Ma Y., Yang J., Wen E., Li H., Yang L., Sun C., Zhao W., Melts D., Dhulipala K., KKishore R., Graf T., Eisenman A., Matam K. K., Gangidi A., Chen G. J., Krishnan M., Nayak A., Nair K., Muthiah B., Khorashadi M., Bhattacharya P., Lapukhov P., Naumov M., Mathews A., Qiao L., Smelyanskiy M., Jia B., Rao V.. “Software-hardware co-design for fast and scalable training of deep learning recommendation models”, Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA '22 (June 18–22, 2022, New York, USA), ACM, New York, 2022, ISBN 978-1-4503-8610-4, pp. 993–1011.
Deakin T., Price J., Martineau M., McIntosh-Smith S.. “Evaluating attainable memory bandwidth of parallel programming models via BabelStream”, International Journal of Computational Science and Engineering, 17:3 (2018), pp. 247–262.
McCalpin J. D.. Memory bandwidth and machine balance in current high performance computers, 1995, IEEE computer society technical committee on computer architecture (TCCA) newsletter, 7 pp.
Tsai Y. M., Cojean T., Anzt H.. “Evaluating the performance of Nvidia's A100 Ampere GPU for sparse and batched computations”, 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (12 November 2020, GA, USA), IEEE, 2020, ISBN 978-0-7381-1048-6, pp. 26–38.
Tsai Y. M., Cojean T., Anzt H.. Evaluating the performance of Nvidia's A100 Ampere GPU for sparse linear algebra computations, 2020, 9 pp.
Hammond J. R., Deakin T., Cownie J., McIntosh-Smith S.. “Benchmarking Fortran DO CONCURRENT on CPUs and GPUs using BabelStream”, 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (13-18 November 2022, Dallas, TX, USA), IEEE, 2022, ISBN 978-1-6654-5185-7, pp. 82–99.
Balos C. J.. “Reproduced computational results report for ‘Ginkgo’: a modern linear operator algebra framework for high performance computing”', ACM Transactions on Mathematical Software (TOMS), 48:1 (2022), 3, 7 pp.
Grützmacher T., Anzt H., Quintana-Ortí E. S.. “Using Ginkgo's memory accessor for improving the accuracy of memorybound low precision BLAS”, Software: Practice and Experience, 53:1, Special Issue: New Trends in High-Performance Computing: Software Systems and Applications (2021), pp. 81–98.
Mixbench, Openbenchmarking.org, Phoronix Media, 2024, A benchmark suite for GPUs on mixed operational intensity kernels URL https://openbenchmarking.org/test/pts/mixbench.
Dong T., Haidar A., Luszczek P., Tomov S., Abdelfattah A., Dongarra J.. “Magma batched: A batched blas approach for small matrix factorizations and applications on gpus”, Journal of Latex class files, 14:8 (2015)Accessed 15.10.2023.
Abdelfattah A., Tomov S., Dongarra J.. “Batch QR factorization on GPUs: design, optimization, and tuning”, Computational Science – ICCS 2022 (June 21–23, 2022, London, UK), Lecture Notes in Computer Science, vol. 13350, Springer, Cham, 2022, ISBN 978-3-031-08750-9, pp. 60–74.
Abdelfattah A., Barra V., Beams N., Bleile R., Brown J., Jean-Camier S., Carson R., Chalmers N., Dobrev V., Dudouit Y., Fischer P., Karakus A., Kerkemeier S., Kolev T., Yu-Lan H., Merzari E., Min M., Phillips M., Rathnayake T., Rieben R., Stitt T., Tomboulides A., Tomov S., Tomov V., Vargas A., Warburton T., Weiss K.. “GPU algorithms for efficient exascale discretizations”, Parallel Computing, 108 (2021), 102841, 10 pp.
Dong T., Dobrev V., Kolev T., Rieben R., Tomov S., Dongarra J.. “A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU-GPU”, 2014 IEEE 28th International Parallel and Distributed Processing Symposium (19–23 May 2014, Phoenix, AZ, USA), IEEE, 2014, ISBN 978-1-4799-3800-1, pp. 972–981.
Abdelfattah A., Baboulin M., Dobrev V., Dongarra J., Earl C., Falcou J., Haidar A., Karlin I., Kolev T., Masliah I., Tomov S.. “High-performance tensor contractions for GPUs”, Procedia Computer Science, 80 (2016), pp. 108–118.
Heinecke A., Henry G., Hutchinson M., Pabst H.. “LIBXSMM: accelerating small matrix multiplications by runtime code generation”, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16 (13–18 November 2016, Salt Lake City, UT), IEEE, 2016, ISBN 978-1-4673-8815-3, pp. 981–991.
Bethune I., Reid F., Lazzaro A.. CP2K Performance from Cray XT3 to XC30, Cray User Group (CUG), 2014, 11 pp.
Sedova A., Tharrington A., Messer B.. Portability in scientific computing: The molecular dynamics non-bonded forces calculation as a case study, 2018, 25 pp.
Waugh H., McIntosh-Smith S.. “On the use of BLAS libraries in modern scientific codes at scale”, Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, 17th Smoky Mountains Computational Sciences and Engineering Conference SMC 2020 (August 26–28, 2020, Oak Ridge, TN, USA), Communications in Computer and Information Science, vol. 1315, Springer, Cham, 2020, ISBN 978-3-030-63392-9, pp. 67–79.
Mijić N., Davidović D.. “Batched matrix operations on distributed GPU's with application in theoretical physics”, 2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO) (23–27 May 2022, Opatija, Croatia), IEEE, 2022, ISBN 978-953-233-103-5, pp. 293–299.
Stylianou C., Weiland M.. Optimizing sparse linear algebra through automatic format selection and machine learning, 2023, 10 pp.
Pascuzzi V. R., Goli M.. “Benchmarking a proof-of-concept performance portable SYCL-based fast Fourier transformation Library”, Proceedings of the 10th International Workshop on OpenCL, International Workshop on OpenCL IWOCL'22 (May 10–12, 2022, Bristol, United Kingdom), ACM, New York, 2022, ISBN 978-1-4503-9658-5, 9 pp.
Tolmachev D.. “VkFFT-a performant, cross-platform and open-source GPU FFT library”, IEEE Access, 11 (2023), pp. 12039–12058.
Li B., Cheng S., Lin J.. tcFFT: Accelerating half-precision FFT through tensor cores, 2021, 10 pp.
Hagerty N., Melesse Vergara V., Tharrington A.. “Studying performance portability of LAMMPS across diverse GPU-based platforms”, S2 World 2020. CUG 2021 & 2022. PN_HCP. HeteroPar 2022, Concurency and computation. Practice and Experience, 35:28 (2023), e7895.
Poenaru A., Lin W.-C., McIntosh-Smith S.. “A performance analysis of modern parallel programming models using a compute-bound application”, ISC High Performance 2021: High Performance Computing, Lecture Notes in Computer Science, vol. 12728, Springer, Cham, 2021, ISBN 978-3-030-78712-7, pp. 332–350.
Solis-Vasquez L., Tillack A. F., Santos D.,-Martins, Koch A., Le S.,Grand, Forli S.. “Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking”, Parallel Computing, 109 (2022), 102861, 12 pp.
Manathunga M., Aktulga H. M., Götz A. W., Merz K. M.. “Quantum mechanics/molecular mechanics simulations on Nvidia and AMD graphics processing units”, J. Chem. Inf. ModelLastPunct, 63:3 (2023), pp. 711-717.
Cruzeiro V. W. D., Manathunga M., Merz K. M., Götz A. W.. “Open-source multi-GPU-accelerated QM/MM simulations with AMBER and QUICK”, J. Chem. Inf. ModelLastPunct, 61:5 (2021), pp. 2109–2115.
Shajan A., Manathunga M., Götz A. W., Merz K. M. Jr.. “Geometry optimization: A comparison of different open-source geometry optimizers”, J. Chem. Theory ComputLastPunct, 19:21 (2023), pp. 7533–7541.
Williams-Young D. B., Asadchev A., Popovici D. T., Clark D., Waldrop J., Windus T., Valeev E. F., de Jong W. A.. “Distributed memory, GPU accelerated Fock construction for hybrid, Gaussian basis density functional theory”, The Journal of Chemical Physics, 158:23 (2023), 234104.
Kim I., Jeong D., Won-Son J., Hyung-Kim J., Rhee Y. M., Jung Y., Choi H., Yim J., Jang I., Kim D. S.. “Kohn-Sham time-dependent density functional theory on the massively parallel GPUs”, npj Computational Materials, 9 (2023), 81, 12 pp.
Siegel A., Draeger E., Deslippe J., Evans T. M., Francois M., Germann T., Martin D., Hart W.. Application Results on Early Exascale Hardware, Oak Ridge National Lab.(ORNL), Oak Ridge, TN (US), 2022, No. ORNL/TM-2022/2437 URL https://info.ornl.gov/sites/publications/Files/Pub176277.pdf.
Dang G., Liu S., Guo T., Dang J., Li X.. “Direct numerical simulation of compressible turbulence accelerated by graphics processing unit: An open-source high accuracy accelerated computational fluid dynamic software”, Physics of Fluids, 34:12 (2022), 126106.
Min M., Tombouldies A.. Simulating Atmospheric Boundary Layer Turbulence with Nek5000/RS, Argonne National Lab.(ANL), Argonne, IL, 2022, No. ANL-22/79, 34 pp.
Fischer P., Kerkemeier S., Min M., Yu-Lan H., Phillips M., Rathnayake T., Merzari E., Tomboulides A., Karakus A., Chalmers N., Warburton T.. “NekRS, a GPU-accelerated spectral element Navier–Stokes solver”, Parallel Computing, 114 (2022), 102982, 13 pp.
FUN3D is a Computational Fluid Dynamics (CFD) suite of tools actively developed at NASA that benefits Aeronautics, Space Technology, and Exloration by modeling fluid flow, NASA Official: David P. Lockard, 14.0.2-16d1333 URL https://fun3d.larc.nasa.gov.
Nastac G., Walden A., Nielsen E. J., Frendi K.. “Implicit thermochemical nonequilibrium flow simulations on unstructured grids using GPUs”, AIAA Scitech 2021 Forum (11–15, 19-21 January 2021, virtual event), 2021, AIAA 2021-0159.
Nastac G., Walden A., Wang L., Nielsen E. J., Liu Y., Opgenorth M., Orender J., Zubair M.. “A multi-architecture approach for implicit computational fluid dynamics on unstructured grids”, AIAA Scitech 2023 Forum (23–27 January 2023, National Harbor, MD & Online), 2023, AIAA 2023-1226.
Pasquariello V., Bunk Y., Eberhardt S., Pei-Huang H., Matheis J., Ugolotti M., Hickel S.. “GPU-accelerated simulations for eVTOL aerodynamic analysis”, AIAA Scitech 2023 Forum (23–27 January 2023, National Harbor, MD & Online), 2023, AIAA 2023-2107.
Regev T., Nestmann J., Garzuzi A., Greenblatt D., Frankel S.. “GPU-accelerated high-fidelity implicit large eddy simulations of coanda cylinder flow instabilities”, AIAA Scitech 2023 Forum (23–27 January 2023, National Harbor, MD & Online), 2023, AIAA 2023-0272.
Kakumani H. C. V., Chamarthi A. S., Hoffmann N., Frankel S. H.. “GPU-accelerated numerical study of temperature effects in choked under-expanded supersonic jets”, AIAA SCITECH 2023 Forum (23–27 January 2023, National Harbor, MD & Online), 2023, AIAA 2023-0976.
Sitaraman J., Jude D.. “Development of GPGPU capable multi-solver overset methods”, AIAA SCITECH 2023 Forum (23–27 January 2023, National Harbor, MD & Online), 2023, AIAA 2023-0042.
Mortazawy M., Rao M., Jilesen J., Work D., Shock R.. Early Stage Vehicle Aerodynamics Development using a GPU Based LBM CFD Solver, 2003, 7 pp.
Kummerländer A., Dorn M., Frank M., Krause M. J.. “Implicit propagation of directly addressed grids in lattice Boltzmann methods”, Concurrency and Computation: Practice and Experience, 35:8 (2023), e7509.
De Vanna F., Avanzi F., Cogo M., Sandrin S., Bettencourt M., Picano F., Benini E.. “URANOS: A GPU accelerated Navier-Stokes solver for compressible wall-bounded flows”, Computer Physics Communications, 2023, 108717, 18 pp.
Chandravamsi H., Chamarthi A. S., Hoffmann N., Frankel S. H.. “On the application of gradient based reconstruction for flow simulations on generalized curvilinear and dynamic mesh domains”, Computers & Fluids, 2023, 105859, 28 pp.
Mattson P., Cheng C., Coleman C., Diamos G., Micikevicius P., Patterson D., Tang H., Gu-Wei Y., Bailis P., Bittorf V., Brooks D., Chen D., Dutta D., Gupta U., Hazelwood K., Hock A., Huang X., Ike A., Jia B., Kang D., Kanter D., Kumar N., Liao J., Ma G., Narayanan D., Oguntebi T., Pekhimenko G., Pentecost L., Reddi V. J., Robie T., John T. S., Tabaru T., Carole-Wu J., Xu L., Yamazaki M., Young C., Zaharia M.. “MLPerf training benchmark”, Proceedings of Machine Learning and Systems. 2, MLSys 2020, eds. I. Dhillon, D. Papailiopoulos, V. Sze, 2020, pp. 336–349.
MLPerf Training v.2.1 Results, ML Commons URL https://mlcommons.org/en/training/.
MLPerf Training HPC v2.0 results, ML Commons URL https://mlcommons.org/en/training-hpc/.
Nvidia NVLink and NVLink Switch, Cloud & Data Center, Nvidia URL https://www.nvidia.com/en-us/data-center/nvlink/.
PRE-EOS 128 NODE DGX SuperPOD — Nvidia DGX H100, Xeon Platinum 8480C 56C 2GHZ, Nvidia H100 Tensor core GPUs, Nvidia ConnectX-7 NDR 400G Infiniband, Top500.org, 2023 URL https://www.top500.org/system/180133/.
Kestrel System Configuration, Alliance for Sustainable Energy, 2023, National Renewable Energy Laboratory Computing Systems URL https://www.nrel.gov/hpc/kestrel-system-configuration.html.
G242-P36 (rev. 100), GIGA-BYTE Technology, 2024, Продукция URL https://www.gigabyte.com/ru/Enterprise/GPU-Server/G242-P36-rev-100.
Nvidia Grace CPU Superchip Whitepaper, V1.1, Nvidia, 2024, 20 pp.
Nvidia Grace CPU Superchip, Nvidia, 2024, Datasheet, 3 pp.
Arm Neoverse V2 Core Technical Reference Manual URL https://developer.arm.com/documentation/102375/latest/Accessed 15.10.2023.
Nvidia GH200 Grace Hopper Siperchip Architecture, V1.01, Nvidia, 2024, 39 pp.
H263-V11, rev. LAW1, Giga-Byte Technology, Products URL https://www.gigabyte.com/Enterprise/High-Density-Server/H263-V11-rev-LAW1.
Petty H., Goldwasser I., Desale P.. One Giant Superchip for LLMs, Recommenders, and GNNs: Introducing Nvidia GH200 NVL32, Nvidia developer, 2023 URL https://developer.nvidia.com/blog/one-giant-superchip-for-llms-recommenders-and-gnns-introducing-nvidia-gh200-nvl32/.
Nvidia BlueField-3 networking platform, Nvidia, 2023, Datasheet, 2 pp.
CUDA Python Manual, v. 12.4.0, Nvidia, 2024 URL https://nvidia.github.io/cuda-python.
Nvidia NVVM IR Specification, V. 12.4, Nvidia, 2024, 80 pp.
Choquette J.. “Nvidia Hopper GPU: scaling performance”, 2022 IEEE Hot Chips 34 Symposium (HCS) (21–23 August 2022, Cupertino, CA, USA), IEEE, 2022, ISBN 978-1-6654-6028-6, pp. 1–46.
Amber22: pmemd.cuda performance information, The Amber project, ed. Kollman P. , 2023 URL https://ambermd.org/GPUPerformance.php.
MLPerf Training v.3.0 Results, MLCommons, 2024 URL https://mlcommons.org/en/training-normal-30/.
AMD Instinct$^{mathrm{TM}}$ MI100 Accelerators, AMD, 2020, Overview URL https://www.amd.com/en/products/accelerators/instinct/mi100.html.
AMD Instinct$^{mathrm{TM}}$ Accelerators, AMD, Products URL https://www.amd.com/en/products/accelerators/instinct.html.
MLPerf Inference Datacenter v.3.1 Results, MLCommons URL https://mlcommons.org/en/inference-datacenter-31/.
Smith A., James N.. “AMD Instinct$^{mathrm{TM}}$ MI200 series accelerator and node architectures”, 2022 IEEE Hot Chips 34 Symposium (HCS) (21–23 August 2022, Cupertino, CA, USA), IEEE, 2022, ISBN 978-1-6654-6028-6, 23 pp.
AMD Instinct$^{mathrm{TM}}$ MI210 Accelerators, AMD, 2022 URL https://www.amd.com/en/products/accelerators/instinct/mi200/mi210.html.
AMD Instinct$^{mathrm{TM}}$ MI250 drivers & support URL https://www.amd.com/en/support/server-accelerators/amd-instinct/amd-instinct-mi-series/amd-instinct-mi250Accessed 15.10.2023.
AMD Instinct$^{mathrm{TM}}$ MI250X drivers & support URL https://www.amd.com/en/support/server-accelerators/amd-instinct/amd-instinct-mi-series/amd-instinct-mi250xAccessed 15.10.2023.
Frontier User Guide, Oak Ridge National Laboratory, 2024 URL https://docs.olcf.ornl.gov/systems/Frontier_user_guide.html.
Introducing AMD CDNA architecture, AMD, 2020, 11 pp.
Introducing AMD CDNA 2 Architecture, AMD Instinct MI200, Advanced Micro Device, 2021, White paper, 17 pp.
“AMD Instinct MI200” instruction set architecture, AMD Instinct MI200, Advanced Micro Devices, 2021, Reference Guide, 275 pp.
Sitaraman C., Chalmers N., Malaya N., McDougal D., O'Reilly O, van Oostrum R.. AMD matrix cores, AMD Labs notes, ed. Greathouse R. , Advanced Micro Devices, 2023, GPU open URL https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/.
Pearson C.. Interconnect bandwidth heterogeneity on AMD MI250x and Infinity fabric, 2023.
Gates M., YarKhan A., Sukkari D., Akbudak K., Cayrols S., Bielich D., Abdelfattah A., Farhan M. A., Dongarra J.. “Portable and efficient dense linear algebra in the beginning of the exascale era”, 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (13–18 November 2022, Dallas, TX, USA), IEEE, 2022, ISBN 978-1-6654-6021-7, pp. 36–46.
Melesse Vergara V. G., Budiardja R. D., Davis M. J., Ezell M. A., Hanley J. A., Zimmer C. J., Brim M. J., Elwasif W. R., Dietz D. T.. “Approaching the final Frontier: lessons learned from the deployment of HPE/Cray EX Spock and Crusher supercomputers”, Cray User Group 2022 Proceedings, CUG (May 2, 2022 – May 5, 2022), Oak Ridge National Lab. (ORNL), 2022 URL https://cug.org/proceedings/cug2022_proceedings/at_a_glance.html.
AMD Instinct Accelerator Qualified Servers Q4 2022, AMD Instinct MI200, Advanced Micro Devices, 2022, Reference Guide, 3 pp.
Welcome to AMD ROCm Platform, Revision e2b73a17, Advanced Micro Devices, 2021 URL https://cgmb-rocm-docs.readthedocs.io/en/latest.
AMD ROCm documentation, V. 6.1.1, 2024 URL https://rocmdocs.amd.com/en/latest/.
Kondratyuk N., Nikolskiy V., Pavlov D., Stegailov V.. “GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP”, The International Journal of High Performance Computing Applications, 35:4 (2021), pp. 312–324.
Charrier D. et al.. GPUFORT: S2S translation tool for CUDA Fortran and Fortran+X in the spirit of hipify, AMD ROCm Software, GitHub Inc. URL https://github.com/ROCmSoftwarePlatform/gpufort#readme.
AMD ROCm documentation, AMD, 2024 URL https://rocm.docs.amd.com/en/latest/.
Khorassani K. S., Chen-Chen C., Ramesh B., Shafi A., Subramoni H., Panda D. K.. “High Performance MPI over the Slingshot Interconnect”, Journal of Computer Science and Technology, 38:1 (2023), pp. 128–145.
Welcome to the LUMI supercomputer user guide, LUMI (Large Unified Modern Infrastructure) consortium URL https://docs.lumi-supercomputer.eu.
Crusher Quick-Start Guide, Oak Ridge National Laboratory, 2024 URL https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html.
Spock Quick-Start Guide, Oak Ridge National Laboratory, 2023 URL https://docs.olcf.ornl.gov/systems/spock_quick_start_guide.html.
ThetaGPU Machine Overview, Argonne National Laboratory URL https://docs.alcf.anl.gov/theta-gpu/hardware-overview/theta-gpu-machine-overview/Accessed 15.10.2023.
Polaris Machine Overview, Argonne National Laboratory, 2023 URL https://docs.alcf.anl.gov/polaris/hardware-overview/machine-overview/.
Summit User Guide, Oak Ridge National Laboratory, 2023, Alpine URL https://docs.olcf.ornl.gov/systems/summit_user_guide.html.
Heroux M. A. et al.. ECP software technology capability assessment report, V3.0, Oak Ridge National Lab, 2022, No. ORNL/TM-2022/2651, 237 pp.
Sathyanarayana S., Bernardini M., Modesti D., Pirozzoli S., Salvadore F.. High-speed turbulent flows towards the exascale: STREAmS-2 porting and performance, 2023, 32 pp.
Müller A., Schmidt B., Membarth R., Lei{ss}a R., Hack S.. AnySeq/GPU: A novel approach for faster sequence alignment on GPUs, 2022, 11 pp.
Manathunga M., Aktulga H. M., Götz A. W., Merz K. M. jr.. “Quantum mechanics/molecular mechanics simulations on Nvidia and AMD Graphics Processing Units”, Journal of Chemical Information and Modeling, 63:3 (2023), pp. 711–717.
Kolev T., Fischer P., Austin A. P., Barker A. T., Beams N., Brown J., Jean-Camier S., Chalmers N., Dobrev V., Dudouit Y., Ghaffary L., Kerkemeier S., Yu-Lan H., Merzari E., Min M., Pazner W., Rathnayake T., Shephard M. S., Siboni M. H., Smith C. W., Thompson J. L., Tomov S., Warburton T.. High-order algorithmic developments and optimizations for large-scale GPU-accelerated simulations, 2021, ECP Milestone Report: WBS 2.2.6.06, Milestone CEED-MS36, 51 pp.
Thavappiragasam M., Elwasif W., Sedova A.. Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms?, 2022, 10 pp.
Stone C. P., Walden A., Zubair M., Nielsen J.. “Accelerating unstructured-grid CFD algorithms on Nvidia and AMD GPUs”, 2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3) (15 November 2021, St. Louis, MO, USA), 2021, ISBN 978-1-6654-1126-4, pp. 19–26.
Pumma S., Vishnu A.. “Semantic-aware lossless data compression for Deep Learning Recommendation Model (DLRM)”, 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) (15 November 2021, St. Louis, MO, USA), IEEE, 2021, ISBN 978-1-6654-1124-0, pp. 1–8.
Han F., Kumar N.. HPC Application Performance on Dell PowerEdge R750xa Servers with the AMD Instinct MI210 Accelerator, Dell, 2022 URL https://infohub.delltechnologies.com/p/hpc-application-performance-on-dell-poweredge-r750xa-servers-with-the-amd-instinct-tm-mi210-accelerator/.
AMD Instinct MI200 Series Accelerators Benchmarks, AMD, 2024 URL https://www.amd.com/en/graphics/server-accelerators-benchmarks.
Yu Y., Cai C., Wang J., Bo Z., Zhu Z., Zheng H.. “Uni-dock: Gpu-accelerated docking enables ultralarge virtual screening”, Journal of Chemical Theory and Computation, 19:11 (2023), pp. 3336–3345.
Hao Y., Zhao X., Bao B., Berard D., Constable W., Aziz A., Liu X.. TorchBench: benchmarking PyTorch with High API surface coverage, 2023, 13 pp.
Punniyamurthy K., Beckmann B. M., Hamidouche K.. Optimizing distributed ML communication with fused computation-collective operations, 2023, 12 pp.
Guo Y., Lu L., Zhu S.. “Novel accelerated methods for convolution neural network with matrix core”, The Journal of Supercomputing, 79:17 (2023), pp. 19547–19573.
Eassa A., Porter C.. Fueling high-performance computing with full-stack innovation, Nvidia developer, 2022 URL https://developer.nvidia.com/blog/fueling-high-performance-computing-with-full-stack-innovation/.
Driving the Industry into the Exascale Era with AMD Instinct Accelerators, AMD Community, AMD, 2022 URL https://community.amd.com/t5/instinct-accelerators/driving-the-industry-into-the-exascale-era-with-amd-instinct/ba-p/539115.
Budiardja R. D., Berrill M., Eisenbach M., Jansen G. R., Joubert W., Nichols S., Rogers D. M., Tharrington A., Messer O. E. B.,. “Ready for the Frontier: preparing applications for the world's first exascale system”, High Performance Computing, ISC High Performance 2023, LNCS, 13948, Springer, Cham, 2023, ISBN 978-3-031-32040-8, pp. 182–201.
Wittwer F., Sauter N. K., Mendez D., Poon B. K., Brewster A. S., Holton J. M., Wall M. E., Hart W. E., Bard D. J., Blaschke J. P.. Accelerating X-ray tracing for exascale systems using Kokkos, 2022, 6 pp.
Papatheodore T.. Frontier/Crusher node performance, Frontier Training Workshop (February 16, 2023), Oak Ridge National Laboratory, 13 pp.
MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, RoCE, and Slingshot, NBCL, NOWLAB: Network Based Computing Lab URL https://mvapich.cse.ohio-state.edu/benchmarks/.
Kurzak J., Malaya N., Klemm M., Hiew E.. Matrix Multiply Stress Test, GitHub inc., 2023 URL https://github.com/AMD-HPC/CoralGemm.
Papatheodore T.. GPU XGEMM, 2023, Benchmark URL https://github.com/tom-papatheodore/gpu_xgemm.
Enhancing LAMMPS simulations with AMD instinct accelerators: unleashing performance and scalability, High Performance Computing, AMD, 2023, Solution Brief, 4 pp.
HPC Comes to Life with AMD Instinct GPUs and NAMD, High Performance Computing, AMD, 2022, Solution Brief, 4 pp.
Pall S., Alekseenko A.. “GROMACS 2023: Readiness on the AMD GPU Heterogeneous Platform”, PDC Newsletters, 2023, no.1.
Molecular Dynamics. Nvidia GPU Benchmarks AMBER 22, Exxact, 2023 URL https://www.exxactcorp.com/blog/Molecular-Dynamics/RTX3090-Benchmarks-for-HPC-AMBER22-A100-vs-RTX3080-vs-RTX3070-vs-RTX6000.
Zeng J., Zhang D., Lu D., Mo P., Li Z., Chen Y., Rynik M., Huang L., Li Z., Shi S., Wang Y.,; Ye H., Tuo P., Yang J., Ding Y., Li Y., Tisi D., Zeng Q., Bao H., Xia Y., Huang J., Muraoka K., Wang Y., Chang J., Yuan F., Bore S. L., Cai C., Lin Y., Wang B., Xu J., Zhu J.-X., Luo C., Zhang Y., Goodall R. E. A., Liang W., Singh A. K., Yao S., Zhang J., Wentzcovitch R., Han J., Liu J., Jia W., York D. M., E W., Car R., Zhang L., Wang H.. “DeePMD-kit v2: A software package for deep potential models”, The Journal of chemical physics, 159:5 (2023), 054801.
Prokopenko A., Sao P., Lebrun-Grandie D.. “A single-tree algorithm to compute the Euclidean minimum spanning tree on GPUs”, ICPP '22: Proceedings of the 51st International Conference on Parallel Processing (29 August 2022–1 September 2022, Bordeaux, France), ACM, New York, 2022, ISBN 978-1-4503-9733-9, 10 pp.
Bagusetty A., Panyala A., Brown G., Kirk J.. “Towards cross-platform portability of coupled-cluster methods with perturbative triples using SYCL”, 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (13–18 November 2022, Dallas, TX, USA), IEEE, 2022, ISBN 978-1-6654-6021-7, pp. 81–88.
Bussy A., Schütt O., Hutter J.. “Sparse tensor based nuclear gradients for periodic Hartree–Fock and low-scaling correlated wave function methods in the CP2K software package: A massively parallel and GPU accelerated implementation”, The Journal of Chemical Physics, 158:16 (Apr 28 2023), 164109.
Mazur L., Bollweg D., Clarke D. A., Altenkort L., Kaczmarek O., Larsen R., Hai-Shu T., Goswami J., Scior P., Sandmeyer H., Neumann M., Dick H., Ali S., Kim J., Schmidt C., Petreczky P., Mukherjee S.. “SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations”, Computer Physics Communications, 300 (July 2024), 109164.
Gottlieb S., Jeong H., Strelchenko A.. Two-link staggered quark smearing in QUDA, 2023, 10 pp.
Mullowney P., Thomas S., Carr A. K., Swirydowicz K., Day M., Esclapez L.. Novel solver algorithms for nearly singular linear systems arising in combustion modelling, 2022 SIAM Conference on Parallel Processing for Scientific Computing (February 23, 2022), National Renewable Energy Lab. (NREL), 2022 URL https://www.nrel.gov/docs/fy22osti/81907.pdf.
Halver R., Junghans C., Sutmann G.. “Using heterogeneous GPU nodes with a Cabana-based implementation of MPCD”, Parallel Computing, 117 (September 2023), 103033.
Min M., Brazell M., Tomboulides A., Churchfield M., Fischer P., Sprague M.. Towards exascale for wind energy simulations, 2022, 16 pp.
Kolev T., Fischer P., Abdelfattah A., Beams N., Brown J., Jean-Camier S., Carson R., Chalmers N., Dobrev V., Dudouit Y., Ghaffari L., Joshi A. Y., Kerkemeier S., Lan Y.-H., Mc Dougall D., Medina D., Min M., Mishra A., Pazner W., Phillips M., Ratnayaka T., Shephard M. S., Siboni M. H., Smith C. W., Thompson J. L., Tomboulides A., Tomov S., Tomov V., Warburton T.. High-order algorithmic developments and optimizations for more robust exascale applications, 2022, ECP Milestone Report. WBS 2.2.6.06, Milestone CEED-MS38, 76 pp.
Lesur G. R. J., Baghdadi S., Wafflard-Ernandez G., Mauxion J., Robert C. M. T., Van den Bossche M.. “IDEFIX: a versatile performance-portable Godunov code for astrophysical flows”, Astronomy and Asrtrophysics, 677 (September 2023), A9, 17 pp.
White C. J., Mullen P. D., Yan-Jiang F., Davis S. W., Stone J. M., Morozova V., Zhang L.. An Extension of the Athena++ code framework for radiation-magnetohydrodynamics in general relativity using a finite-solid-angle discretization, vol. 949, 2023, 29 pp.
Grete P., Dolence J. C., Miller J. M., Brown J., Ryan B., Gaspar A., Glines F., Swaminarayan S., Lippuner J., Solomon C. J., Shipman G., Junghans C., Holladay D., Stone J. M., Roberts L. F.. “Parthenon—a performance portable block-structured adaptive mesh refinement framework”, The International Journal of High Performance Computing Applications, 37:5 (2023), pp. 465–486.
Schild N., Räth M., Eibl S., Hallatschek K., Kormann K.. “A performance portable implementation of the semi-Lagrangian algorithm in six dimensions”, Computer Physics Communications, 295 (February 2024), 108973.
Sfiligoi I., Belli E. A., Candy J., Budiardja R. D.. “Optimization and Portability of a Fusion OpenACC-based Fortran HPC code from Nvidia to AMD GPUs”, PEARC '23: Practice and Experience in Advanced Research Computing (Portland, OR, USA, July 23–27, 2023), ACM, New York, July 2023, pp. 246–250.
Diederichs S., Benedetti C., Huebl A., Lehe R., Myers A., Sinn A., J.-Vay L., Zhang W., Thévenet M.. “HiPACE++: a portable, 3D quasi-static particle-in-cell code”, Computer Physics Communications, 278 (September 2022), 108421.
Breaking Barriers in Plasma Physics with PIConGPU and AMD Instinc MI250 GPU, High Performance Computing, AMD, 2023, 4 pp.
Fedeli L., Huebl A., Boillod-Cerneux F., Clark T., Gott K., Hillairet C., Jaure S., Leblanc A., Lehe R., Myers A., Piechurski C., Sato M., Zaim N., Zhang W., Jean-Vay L., Vincenti H.. “Pushing the Frontier in the design of laser-based electron accelerators with groundbreaking mesh-refined particle-in-cell simulations on exascale-class supercomputers”, 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (Dallas, Texas, USA, November 13–18, 2022), IEEE, 2022, 12 pp.
Huebl A., Lehe R., Zoni E., Shapoval O., Sandberg R. T., Garten M., Formenti A., Jambunathan R., Kumar P., Gott K., Myers A., Zhang W., Almgren A., Mitchell C. E., Qiang J., Grote D., Sinn A., Diederichs S., Thevenet M., Fedeli L., Clark T., Zaim N., Vincenti H., Jean-Vay L.,. From compact plasma particle sources to advanced accelerators with modeling at exascale, 2023, 4 pp.
Adams M. F., Wang P., Merson J., Huck K., Knepley M. G.. A performance portable, fully implicit Landau collision operator with batched linear solvers, 2024, 20 pp.
Thawakar O., Anwer R. M., Laaksonen J., Reiner O., Shah M., Khan F. S.. 3D mitochondria instance segmentation with spatio-temporal transformers, 2023, 10 pp.
Samuel D., Kutuzov A., Touileb S., Velldal E., Øvrelid L., Rønningstad E., Sigdel E., Palatkina A.. NorBench–A benchmark for Norwegian language models, 2023, 16 pp.
Yankovskaya L., Tars M., Tättar A., Fishel M.. “Machine translation for low-resource Finno-Ugric languages”, Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) (Tórshavn, Faroe Islands, May 22–24, 2023), University of Tartu Library, 2023, ISBN 978-99-1621-999-7, pp. 762–771.
Charpentier L., Wold S., Samuel D., R{ø}nningstad E.. “BRENT: Bidirectional retrieval enhanced Norwegian transformer”, Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) (Tórshavn, Faroe Islands, May 22–24, 2023), University of Tartu Library, 2023, ISBN 978-99-1621-999-7, pp. 202–214.
Samuel D., {Ø}vrelid L.. “Tokenization with factorized subword encoding”, Findings of the Association for Computational Linguistics: ACL 2023 (Toronto, Canada, July 9–14, 2023), ACL, 2023, ISBN 9781959429623, pp. 14143–14161.
AMD Radeon Instinct MI300, GPU Specs Database, TechPowerUp URL https://www.techpowerup.com/gpu-specs/radeon-instinct-mi300.c4019.
Naffziger S., Beck N., Burd T., Lepak K., Loh G. H., Subramony M., White S.. “Pioneering chiplet technology and design for the AMD EPYC$^{mathrm{TM}}$ and Ryzen$^{mathrm{TM}}$ processor families: Industrial product”, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) (Valencia, Spain, 14–18 June 2021), IEEE, 2021, ISBN 978-1-4503-9086-6, pp. 57–70.

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

Vol 16, No 3 (2025)