Local methods with adaptivity via scaling
- Authors: Chezhegov S.1,2,3, Skorik S.N.2, Khachaturov N.4, Shalagin D.5, Avetisyan A.A.6,2, Takáč M.7, Kholodov Y.A.5, Beznosikov A.N.2,3,6
-
Affiliations:
- Moscow Institute of Physics and Technology (National Research University)
- Ivannikov Institute for System Programming of the RAS
- Sber AI Lab
- Russian-Armenian University
- Innopolis University
- Russian Academy of National Economy and Public Administration under the President of the Russian Federation
- Mohamed bin Zayed University of Artificial Intelligence
- Issue: Vol 79, No 6 (2024)
- Pages: 117-158
- Section: Articles
- URL: https://bakhtiniada.ru/0042-1316/article/view/281943
- DOI: https://doi.org/10.4213/rm10209
- ID: 281943
Cite item
Abstract
The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, there is a need to organize the training process so as to minimize the time lost due to communication. A widely used and extensively researched technique to mitigate the communication bottleneck involves performing local training before communication. This approach is the focus of our paper. Concurrently, adaptive methods that incorporate scaling, notably led by Adam, gained significant popularity in recent years. Therefore, this paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods. We consider the classical Local SGD method and enhance it with a scaling feature. A crucial aspect is that scaling is described generically, allowing us to analyze various approaches, including Adam, RMSProp, and OASIS, in a unified manner. In addition to the theoretical analysis, we validate the performance of our methods in practice by training a neural network.Bibliography: 49 titles.
About the authors
Savelii Chezhegov
Moscow Institute of Physics and Technology (National Research University); Ivannikov Institute for System Programming of the RAS; Sber AI Lab
Author for correspondence.
Email: chezhegov.sa@phystech.edu
Sergej Nikolaevich Skorik
Ivannikov Institute for System Programming of the RAS
Email: skorik@ispras.ru
ORCID iD: 0000-0002-8316-7302
Nikolas Khachaturov
Russian-Armenian University
Email: nickhach23@gmail.com
Danil Shalagin
Innopolis University
Email: d.shalagin@innopolis.university
Aram Arut'unovich Avetisyan
Russian Academy of National Economy and Public Administration under the President of the Russian Federation; Ivannikov Institute for System Programming of the RAS
Email: a.a.avetisyan@ispras.ru
ORCID iD: 0000-0002-7066-6954
Martin Takáč
Mohamed bin Zayed University of Artificial Intelligence
Email: martin.takac@mbzuai.ac.ae
Yaroslav Aleksandrovich Kholodov
Innopolis University
Email: ya.kholodov@innopolis.ru
ORCID iD: 0000-0003-2466-1594
SPIN-code: 8378-7971
Scopus Author ID: 6602420821
ResearcherId: K-7736-2013
Doctor of physico-mathematical sciences, Professor
Aleksandr Nikolaevich Beznosikov
Ivannikov Institute for System Programming of the RAS; Sber AI Lab; Russian Academy of National Economy and Public Administration under the President of the Russian Federation
Email: anbeznosikov@gmail.com
References
- Ш. Шалев-Шварц, Ш. Бен-Давид, Идеи машинного обучения: от теории к алгоритмам, ДМК Пресс, М., 2019, 436 с.
- Я. Гудфеллоу, И. Бенджио, А. Курвиль, Глубокое обучение, 2-е изд., ДМК Пресс, М., 2018, 652 с.
- S. Arora, N. Cohen, E. Hazan, “On the optimization of deep networks: implicit acceleration by overparameterization”, Proceedings of the 35th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 80, 2018, 244–253
- V. Smith, S. Forte, Chenxin Ma, M. Takač, M. I. Jordan, M. Jaggi, “CoCoA: A general framework for communication-efficient distributed optimization”, J. Mach. Learn. Res., 18 (2018), 230, 49 pp.
- K. Mishchenko, E. Gorbunov, M. Takač, P. Richtarik, Distributed learning with compressed gradient differences, 2023 (v1 – 2019), 59 pp.
- J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 33 pp.
- S. Chraibi, A. Khaled, D. Kovalev, P. Richtarik, A. Salim, M. Takač, Distributed fixed point methods with compressed iterates, 2019, 15 pp.
- V. Pirau, A. Beznosikov, M. Takač, V. Matyukhin, A. Gasnikov, “Preconditioning meets biased compression for efficient distributed optimization”, Comput. Manag. Sci., 21:1 (2024), 14, 22 pp.
- J. Konečny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, D. Bacon, Federated learning: strategies for improving communication efficiency, 2017 (v1 – 2016), 10 pp.
- P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al., “Advances and open problems in federated learning”, Found. Trends Mach. Learn., 14:1-2 (2021), 1–210
- S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5132–5143
- Y. Arjevani, O. Shamir, “Communication complexity of distributed convex learning and optimization”, NIPS {'}15: Proceedings of the 28th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 1756–1764, https://papers.nips.cc/paper_files/paper/2015/hash/7fec306d1e665bc9c748b5d2b99a6e97-Abstract.html
- D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding”, NIPS {'}17: Proceedings of the 31st international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 30, Curran Associates, Inc., Red Hook, NY, 2017, 1707–1718
- A. Beznosikov, S. Horvath, P. Richtarik, M. Safaryan, On biased compression for distributed learning, 2024 (v1 – 2020), 50 pp.
- L. O. Mangasarian, “Parallel gradient distribution in unconstrained optimization”, SIAM J. Control Optim., 33:6 (1995), 1916–1925
- S. U. Stich, Local SGD converges fast and communicates little, 2019 (v1 – 2018), 19 с.
- B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282
- Jianyu Wang, V. Tantia, N. Ballas, M. Rabbat, SlowMo: Improving communication-efficient distributed SGD with slow momentum, 2020 (v1 – 2019), 27 pp.
- D. Basu, D. Data, C. Karakus, S. N. Diggavi, “Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations”, IEEE J. Sel. Areas Inform. Theory, 1:1 (2020), 217–226
- A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 2021–2031
- Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, Yifei Cheng, Variance reduced local SGD with lower communication complexity, 2019, 25 с.
- P. Sharma, S. Kafle, P. Khanduri, S. Bulusu, K. Rajawat, P. K. Varshney, Parallel restarted SPIDER–communication efficient distributed nonconvex optimization with optimal computation complexity, 2020 (v1 – 2019), 25 с.
- K. Mishchenko, G. Malinovsky, S. Stich, P. Richtarik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!”, Proceedings of the 39th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 162, 2022, 15750–15769
- O. Shamir, N. Srebro, Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008
- H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227
- A. Beznosikov, G. Scutari, A. Rogozin, A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS {'}21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184
- D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, G. Scutari, Optimal gradient sliding and its application to distributed optimization under similarity, 2022, 24 с.
- D. P. Kingma, J. L. Ba, Adam: a method for stochastic optimization, 2017 (v1 – 2014), 15 pp.
- T. Tieleman, G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”, COURSERA: Neural networks for machine learning, 4:2 (2012), 26–31
- J. Duchi, E. Hazan, Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization”, J. Mach. Learn. Res., 12 (2011), 2121–2159
- C. Bekas, E. Kokiopoulou, Y. Saad, “An estimator for the diagonal of a matrix”, Appl. Numer. Math., 57:11-12 (2007), 1214–1229
- A. Sadiev, A. Beznosikov, A. J. Almansoori, D. Kamzolov, R. Tappenden, M. Takač, “Stochastic gradient methods with preconditioned updates”, J. Optim. Theory Appl., 201:2 (2024), 471–489
- M. Jahani, S. Rusakov, Zheng Shi, P. Richtarik, M. W. Mahoney, M. Takač, Doubly adaptive scaled algorithm for machine learning using second-order information, 2021, 44 с.
- J. E. Dennis, Jr., J. J. More, “Quasi-Newton methods, motivation and theory”, SIAM Rev., 19:1 (1977), 46–89
- R. Fletcher, Practical methods of optimization, 2nd ed., Wiley-Intersci. [John Wiley & Sons], New York, 2013, xiv+436 pp.
- A. Khaled, K. Mishchenko, P. Richtarik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529
- A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393
- A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS {'}22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133
- M. R. Glasgow, Honglin Yuan, Tengyu Ma, “Sharp bounds for federated averaging (local SGD) and continuous perspective”, Proceedings of the 25th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 151, 2022, 9050–9090
- A. Sadiev, A. Beznosikov, A. J. Almansoori, D. Kamzolov, R. Tappenden, M. Takač, Stochastic gradient methods with preconditioned updates, 2024 (v1 – 2022), 40 pp.
- A. Beznosikov, A. Alanov, D. Kovalev, M. Takač, A. Gasnikov, On scaled methods for saddle point problems, 2023 (v1 – 2022), 54 pp.
- S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečny, S. Kumar, H. B. McMahan, Adaptive federated optimization, 2021 (v1 – 2020), 38 pp.
- Y. Nesterov, Introductory lectures on convex optimization. A basic course, Appl. Optim., 87, Kluwer Acad. Publ., Boston, MA, 2004, xviii+236 pp.
- А. С. Немировский, Д. Б. Юдин, Сложность задач и эффективность методов оптимизации, Наука, М., 1979, 384 с.
- Lam Nguyen, Phuong Ha Nguyen, M. Dijk, P. Richtarik, K. Scheinberg, M. Takac, “SGD and Hogwild! Convergence without the bounded gradients assumption”, Proceedings of the 35th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 80, 2018, 3750–3758
- Zhewei Yao, A. Gholami, Sheng Shen, M. Mustafa, K. Keutzer, M. Mahoney, “ADAHESSIAN: An adaptive second order optimizer for machine learning”, Proc. Int. AAAI Conf., 35:12 (2021), 10665–10673
- A. Defossez, L. Bottou, F. Bach, N. Usunier, A simple convergence proof of Adam and Adagrad, 2022 (v1 – 2020), 30 pp.
- A. Krizhevsky, Learning multiple layers of features from tiny images, Tech. rep., Univ. of Toronto, Toronto, ON, 2009, 60 pp.,
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep residual learning for image recognition”, Proceedings of the 2016 IEEE conference on computer vision and pattern recognition (Las Vegas, NV), IEEE, 2016, 770–778
Supplementary files
