Local methods with adaptivity via scaling

Savelii Chezhegov; Чежегов Савелий; Sergej Nikolaevich Skorik; Скорик Сергей Николаевич; Nikolas Khachaturov; Хачатуров Николас; Danil Shalagin; Шалагин Данил; Aram Arut'unovich Avetisyan; Аветисян Арам Арутюнович; Martin Takáč; Такач Мартин; Yaroslav Aleksandrovich Kholodov; Холодов Ярослав Александрович; Aleksandr Nikolaevich Beznosikov; Безносиков Александр Николаевич

doi:10.4213/rm10209

Local methods with adaptivity via scaling

Authors: Chezhegov S.¹^,2^,3, Skorik S.N.², Khachaturov N.⁴, Shalagin D.⁵, Avetisyan A.A.⁶^,2, Takáč M.⁷, Kholodov Y.A.⁵, Beznosikov A.N.²^,3^,6
Affiliations:
1. Moscow Institute of Physics and Technology (National Research University)
2. Ivannikov Institute for System Programming of the RAS
3. Sber AI Lab
4. Russian-Armenian University
5. Innopolis University
6. Russian Academy of National Economy and Public Administration under the President of the Russian Federation
7. Mohamed bin Zayed University of Artificial Intelligence
Issue: Vol 79, No 6 (2024)
Pages: 117-158
Section: Articles
URL: https://bakhtiniada.ru/0042-1316/article/view/281943
DOI: https://doi.org/10.4213/rm10209
ID: 281943

Cite item

Full Text

Open Access
Restricted Access

Access granted
Restricted Access

Subscription Access

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, there is a need to organize the training process so as to minimize the time lost due to communication. A widely used and extensively researched technique to mitigate the communication bottleneck involves performing local training before communication. This approach is the focus of our paper. Concurrently, adaptive methods that incorporate scaling, notably led by Adam, gained significant popularity in recent years. Therefore, this paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods. We consider the classical Local SGD method and enhance it with a scaling feature. A crucial aspect is that scaling is described generically, allowing us to analyze various approaches, including Adam, RMSProp, and OASIS, in a unified manner. In addition to the theoretical analysis, we validate the performance of our methods in practice by training a neural network.Bibliography: 49 titles.

Keywords

convex optimization, distributed optimization, adaptive methods, preconditioning

References

Ш. Шалев-Шварц, Ш. Бен-Давид, Идеи машинного обучения: от теории к алгоритмам, ДМК Пресс, М., 2019, 436 с.
Я. Гудфеллоу, И. Бенджио, А. Курвиль, Глубокое обучение, 2-е изд., ДМК Пресс, М., 2018, 652 с.
S. Arora, N. Cohen, E. Hazan, “On the optimization of deep networks: implicit acceleration by overparameterization”, Proceedings of the 35th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 80, 2018, 244–253
V. Smith, S. Forte, Chenxin Ma, M. Takač, M. I. Jordan, M. Jaggi, “CoCoA: A general framework for communication-efficient distributed optimization”, J. Mach. Learn. Res., 18 (2018), 230, 49 pp.
K. Mishchenko, E. Gorbunov, M. Takač, P. Richtarik, Distributed learning with compressed gradient differences, 2023 (v1 – 2019), 59 pp.
J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 33 pp.
S. Chraibi, A. Khaled, D. Kovalev, P. Richtarik, A. Salim, M. Takač, Distributed fixed point methods with compressed iterates, 2019, 15 pp.
V. Pirau, A. Beznosikov, M. Takač, V. Matyukhin, A. Gasnikov, “Preconditioning meets biased compression for efficient distributed optimization”, Comput. Manag. Sci., 21:1 (2024), 14, 22 pp.
J. Konečny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, D. Bacon, Federated learning: strategies for improving communication efficiency, 2017 (v1 – 2016), 10 pp.
P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al., “Advances and open problems in federated learning”, Found. Trends Mach. Learn., 14:1-2 (2021), 1–210
S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5132–5143
Y. Arjevani, O. Shamir, “Communication complexity of distributed convex learning and optimization”, NIPS {'}15: Proceedings of the 28th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 1756–1764, https://papers.nips.cc/paper_files/paper/2015/hash/7fec306d1e665bc9c748b5d2b99a6e97-Abstract.html
D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding”, NIPS {'}17: Proceedings of the 31st international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 30, Curran Associates, Inc., Red Hook, NY, 2017, 1707–1718
A. Beznosikov, S. Horvath, P. Richtarik, M. Safaryan, On biased compression for distributed learning, 2024 (v1 – 2020), 50 pp.
L. O. Mangasarian, “Parallel gradient distribution in unconstrained optimization”, SIAM J. Control Optim., 33:6 (1995), 1916–1925
S. U. Stich, Local SGD converges fast and communicates little, 2019 (v1 – 2018), 19 с.
B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282
Jianyu Wang, V. Tantia, N. Ballas, M. Rabbat, SlowMo: Improving communication-efficient distributed SGD with slow momentum, 2020 (v1 – 2019), 27 pp.
D. Basu, D. Data, C. Karakus, S. N. Diggavi, “Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations”, IEEE J. Sel. Areas Inform. Theory, 1:1 (2020), 217–226
A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 2021–2031
Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, Yifei Cheng, Variance reduced local SGD with lower communication complexity, 2019, 25 с.
P. Sharma, S. Kafle, P. Khanduri, S. Bulusu, K. Rajawat, P. K. Varshney, Parallel restarted SPIDER–communication efficient distributed nonconvex optimization with optimal computation complexity, 2020 (v1 – 2019), 25 с.
K. Mishchenko, G. Malinovsky, S. Stich, P. Richtarik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!”, Proceedings of the 39th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 162, 2022, 15750–15769
O. Shamir, N. Srebro, Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008
H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227
A. Beznosikov, G. Scutari, A. Rogozin, A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS {'}21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184
D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, G. Scutari, Optimal gradient sliding and its application to distributed optimization under similarity, 2022, 24 с.
D. P. Kingma, J. L. Ba, Adam: a method for stochastic optimization, 2017 (v1 – 2014), 15 pp.
T. Tieleman, G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”, COURSERA: Neural networks for machine learning, 4:2 (2012), 26–31
J. Duchi, E. Hazan, Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization”, J. Mach. Learn. Res., 12 (2011), 2121–2159
C. Bekas, E. Kokiopoulou, Y. Saad, “An estimator for the diagonal of a matrix”, Appl. Numer. Math., 57:11-12 (2007), 1214–1229
A. Sadiev, A. Beznosikov, A. J. Almansoori, D. Kamzolov, R. Tappenden, M. Takač, “Stochastic gradient methods with preconditioned updates”, J. Optim. Theory Appl., 201:2 (2024), 471–489
M. Jahani, S. Rusakov, Zheng Shi, P. Richtarik, M. W. Mahoney, M. Takač, Doubly adaptive scaled algorithm for machine learning using second-order information, 2021, 44 с.
J. E. Dennis, Jr., J. J. More, “Quasi-Newton methods, motivation and theory”, SIAM Rev., 19:1 (1977), 46–89
R. Fletcher, Practical methods of optimization, 2nd ed., Wiley-Intersci. [John Wiley & Sons], New York, 2013, xiv+436 pp.
A. Khaled, K. Mishchenko, P. Richtarik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529
A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393
A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS {'}22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133
M. R. Glasgow, Honglin Yuan, Tengyu Ma, “Sharp bounds for federated averaging (local SGD) and continuous perspective”, Proceedings of the 25th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 151, 2022, 9050–9090
A. Sadiev, A. Beznosikov, A. J. Almansoori, D. Kamzolov, R. Tappenden, M. Takač, Stochastic gradient methods with preconditioned updates, 2024 (v1 – 2022), 40 pp.
A. Beznosikov, A. Alanov, D. Kovalev, M. Takač, A. Gasnikov, On scaled methods for saddle point problems, 2023 (v1 – 2022), 54 pp.
S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečny, S. Kumar, H. B. McMahan, Adaptive federated optimization, 2021 (v1 – 2020), 38 pp.
Y. Nesterov, Introductory lectures on convex optimization. A basic course, Appl. Optim., 87, Kluwer Acad. Publ., Boston, MA, 2004, xviii+236 pp.
А. С. Немировский, Д. Б. Юдин, Сложность задач и эффективность методов оптимизации, Наука, М., 1979, 384 с.
Lam Nguyen, Phuong Ha Nguyen, M. Dijk, P. Richtarik, K. Scheinberg, M. Takac, “SGD and Hogwild! Convergence without the bounded gradients assumption”, Proceedings of the 35th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 80, 2018, 3750–3758
Zhewei Yao, A. Gholami, Sheng Shen, M. Mustafa, K. Keutzer, M. Mahoney, “ADAHESSIAN: An adaptive second order optimizer for machine learning”, Proc. Int. AAAI Conf., 35:12 (2021), 10665–10673
A. Defossez, L. Bottou, F. Bach, N. Usunier, A simple convergence proof of Adam and Adagrad, 2022 (v1 – 2020), 30 pp.
A. Krizhevsky, Learning multiple layers of features from tiny images, Tech. rep., Univ. of Toronto, Toronto, ON, 2009, 60 pp.,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep residual learning for image recognition”, Proceedings of the 2016 IEEE conference on computer vision and pattern recognition (Las Vegas, NV), IEEE, 2016, 770–778

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

Local methods with adaptivity via scaling

Full Text

Abstract

Keywords

About the authors

References

Supplementary files