Reducing the search space for optimal clustering parameters using a small amount of labeled data
- Авторлар: Yuferev V.I.1, Razin N.A.1
-
Мекемелер:
- The Central Bank of the Russian Federation
- Шығарылым: № 1 (2024)
- Беттер: 103-117
- Бөлім: Analysis of Textual and Graphical Information
- URL: https://bakhtiniada.ru/2071-8594/article/view/269794
- DOI: https://doi.org/10.14357/20718594240109
- EDN: https://elibrary.ru/WWPOLG
- ID: 269794
Дәйексөз келтіру
Толық мәтін
Аннотация
The paper presents a method for reducing the search space for optimal clustering parameters. This is achieved by selecting the most appropriate data transformation methods and dissimilarity measures at the stage prior to performing the clustering itself. To compare the selected methods, it is proposed to use the silhouette coefficient, which considers class labels from a small labeled data set as cluster labels. The results of an experimental test of the proposed approach for clustering news texts are presented.
Негізгі сөздер
Толық мәтін

Авторлар туралы
Vitaly Yuferev
The Central Bank of the Russian Federation
Хат алмасуға жауапты Автор.
Email: YuferevVI@cbr.ru
Consultant, Innovative Laboratory “Novosibirsk”, Department of Information Technologies
Ресей, MoscowNikolai Razin
The Central Bank of the Russian Federation
Email: RazinNA@cbr.ru
Candidate of Physical and Mathematical Sciences, Head of the Center of Competence in Artificial Intelligence and Advanced Analytics, Data Management Department
Ресей, MoscowӘдебиет тізімі
- Ackerman M., Adolfsson A., Brownstein N. An effective and efficient approach for clusterability evaluation. arXiv:1602.06687. 2016.
- Bergstra J., Bengio Y. Random search for hyper-parameter optimization // Journal of Machine Learning Research. 2012. V. 13. No 2. P. 281-305.
- Bora M.D.J., Gupta D.A.K. Effect of Different Distance Measures on the Performance of K-Means Algorithm: An Experimental Study in Matlab // Internatinonal Journal of Computer Science and Information Techonolgies. 2014. V. 5. No 2. P. 2501–2506.
- Brazdil P., Giraud-Carrier C., Soares C., Vilalta R. Metalearning: Applications to Data Mining. Berlin, Heidelberg: Springer Science & Business Media, 2008. doi: 10.1007/978-3-540-73263-1.
- Dash M., Choi K., Scheuermann P., Liu H. Feature selection for clustering-a filter solution // 2002 IEEE International Conference on Data Mining. Proceedings IEEE. 2002. P. 115–122.
- Data Clustering: Algorithms and Applications. Ed. by C.C. Aggarwal, C.K. Reddy. New York: Chapman and Hall/CRC, 2014. doi: 10.1201/9781315373515
- Feurer M., Hutter F. Hyperparameter Optimization // Automated Machine Learning. Ed. by F. Hutter et al. Cham: Springer, 2019. P. 3-33. doi: 10.1007/978-3-030-05318-5_1.
- Hernández-Reyes E., García-Hernández R.A., CarrascoOchoa J.A., Martínez-Trinidad J.F. Document Clustering Based on Maximal Frequent Sequences // Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science. Berlin: Springer, 2006. V. 4139. P. 257–267. doi: 10.1007/11816508_27
- Holder C., Middlehurst M., Bagnal A. A Review and Evaluation of Elastic Distance Functions for Time Series Clustering // Knowledge and Information Systems. 2023. V. 66, P. 765-809, 2023
- Hui X., Li Z. Clustering Validation Measures // Data Clustering: Algorithms and Applications. Boca Raton: CRC Press, 2014. P. 571-606.
- Jain A.K., Murty M.N., Flynn P.J. Data Clustering: a review // ACM Computing Surveys. New York: Association for Computing Machinery, 1999. V. 31. P. 264-323.
- Kassambara A. Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning. Sthda, 2017. V. 1. ISBN: 978-1-5424-6270-9.
- Kaufman L., Rousseeuw P. Clustering by Means of Medoids // Data Analysis based on the L1-Norm and Related Methods. Ed. by Y. Dodge. North-Holland. 1987. P. 405-416.
- Li Y., Zhang Y., Wei X. Hyper-parameter estimation method with particle swarm optimization. arXiv:2011.11944v2. 2020.
- Mahdavi K. Enhanced clustering analysis pipeline for performance analysis of parallel applications: Tesi doctoral, Universitat Politècnica de Catalunya, Departament d'Arquitectura de Computadors. Barcelona, 2022. doi: 10.5821/dissertation-2117-375586.
- Nelder J.A., Mead R. A simplex method for function optimization // The Computer Journal. 1965. V. 7. No 4. P. 308-313.
- Nguyen Q.H., Rayward-Smith V.J. Internal quality measures for clustering in metric spaces // International Journal of Business Intelligence and Data Mining. 2008. V.3. No 1. P. 4–29.
- Romano S., Vinh N.X., Bailey J., Verspoor K. Adjusting for chance clustering comparison measures // Journal of Machine Learning Research. 2016. V. 17. No 1. P. 4635–4666.
- Rousseeuw P. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis // Journal of Computational and Applied Mathematics. 1987. V. 20. P. 53-65. doi: 10.1016/0377-0427(87)90125-7.
- Schneider M., Grinsell J., Russell T., Hickman R., Thomson R. Identifying Indicators of Bias in Data Analysis Using Proportionality and Separability Metrics // Proceedings of SBPBRiMS conference. Washington, 2019. URL: http://sbp-brims.org/2019/proceedings/papers/working_papers/Schneider.pdf (accessed: 30.01.2024).
- Thornton C., Hutter F., Hoos H.H., Leyton-Brown K. AutoWEKA: Combined selection and hyperparameter optimization of classification algorithms // Proceedings of the 19th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining. Chicago, 2013. P. 847-855.
- Tong Y., Hong Z. Hyper-Parameter Optimization: A Review of Algorithms and Applications. arXiv:2003.05689. 2020.
- Vincent A.M., Jidesh P. An improved hyperparameter optimization framework for AutoML systems using evolutionary algorithms // Scientific Reports. 2023. V. 13. No 1. P. 4737. doi: 10.1038/s41598-023-32027-3.
- Vinh N.X., Epps J., Bailey J. Information theoretic measures for clustering comparison: is a correction for chance necessary? // Proceedings of the 26th Annual International Conference on Machine Learning – ICML’09. Montreal, 2009. P. 1073–1080. doi: 10.1145/1553374.1553511.
- Vysala A., Gomes J. Evaluating and Validating Cluster Results // Proceedings of 9th International Conference on Advanced Information Technologies and Applications (ICAITA 2020). 2020. V. 10. No 9. P. 37-45. doi: 10.5121/csit.2020.100904
- Wu J., Chen X.-Y., Zhang H., Xiong L.-D., Lei H., Deng S. Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization // Journal of Electronic Science and Technology. 2019. V. 17. No 1. P. 26-40. doi: 10.11989/JEST.1674-862X.80904120.
- Xu R., Wunsch D. Survey of clustering algorithms // IEEE Transactions on Neural Networks, 2005. V. 16. No 3. P. 645-678. doi: 10.1109/TNN.2005.845141.
- Yang L., Shami A. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. arXiv:2007.15745v3. 2022.
Қосымша файлдар

Ескертпе
* This article reflects the personal position of the authors. The content and results of this study should not be considered, including quoted in any publications, as the official position of the Bank of Russia or an indication of the official policy or decisions of the regulator. Any errors in this material are solely copyrighted.