10.14489/vkit.2025.07.pp.015-022

DOI: 10.14489/vkit.2025.07.pp.015-022

Ячная В. О.
БАЛАНСИРОВКА БАЗЫ ВИДЕОДАННЫХ ДЛЯ ОПРЕДЕЛЕНИЯ ЧИСЛА ФОНЕМ В СЛОВЕ ПО АРТИКУЛЯЦИИ
(с. 15-22)

Аннотация. Автоматическое визуальное распознавание компонентов общения играет важную роль в ряде современных приложений. Оценка паравербальных компонентов общения, например темпа артикулирования за счет определения количества фонем, существенно влияет на восприятие и понимание коммуникации. В проведенных ранее разработках по определению числа минимальных единиц языка в речи по артикуляции столкнулись с проблемой нехватки и несбалансированности обучающих данных. В частности, число примеров в классах могло отличаться в несколько десятков раз. Несбалансированность обучающей выборки может приводить к смещению модели и снижению точности ее работы. В связи с этим в данном исследовании оцениваются методы балансировки данных для решения задачи определения числа фонем в слове английского языка по видео с помощью модели, основанной на нейронной сети ResNet-18 c трехмерными свертками. В этих целях рассматриваются методы взвешивания функции потерь, а также уменьшения выборки (андерсэмплинг) и ее увеличения (оверсэмплинг). Выборка уменьшается путем удаления случайных видеофрагментов и за счет уменьшения словаря обучающих примеров. Увеличение выборки осуществляется классическими методами аугментации видеоданных, а также путем пополнения видеофрагментами из смежных баз данных. Проведенные эксперименты показывают, что методы балансировки данных имеют определенный положительный эффект, однако не являются определяющими для текущей эталонной модели в данной задаче.

Ключевые слова: распознавание речи; компьютерное зрение; балансировка данных; аугментация видео; нейронные сети.

Yachnaya V. O.
BALANCING VISUAL DATASET TO ESTIMATE NUMBER OF PHONEMES IN A WORD BY ARTICULATION
(pp. 15-22)

Abstract. Automatic communication components visual recognition plays an important role in a number of modern applications, such as human-computer interaction, accessibility for individuals with hearing impairments, and advanced communication analysis systems. This technology enables detailed evaluation of verbal, non-verbal and paraverbal communication components. In particular, determining the number of phonemes based on articulation can provide systems with subtle paraverbal elements, thus improving both human and machine interpretation of communication. Previous developments in visual determining of minimal linguistic units in speech encountered significant challenges, particularly due to insufficient and imbalanced training datasets. For example, the disparity in data representation among classes was substantial, with the number of examples per class differing by factors of ten or more. Such an imbalance can cause bias in machine learning models, leading to skewed predictions and reduced overall accuracy, particularly for underrepresented classes. As a result, data balancing is essential for achieving more accurate and reliable results. In this context, the present study focuses on evaluating various data balancing techniques to address the task of visual speech recognition and counting the number of phonemes in English words. The considered model is based on ResNet-18 with 3D convolutions. The study examines the effects of both undersampling and oversampling approaches. Undersampling is performed by cutting both video fragments randomly in each class and video fragments of specific words entirely. Oversampling is carried out using classical video augmentation methods, specifically via VidAug library, and by cross-dataset augmentation. Experimental results show that these data balancing methods positively influence the training process, enhancing the model's ability to generalize across classes and improving its accuracy to a certain extent. However, they are not fully decisive in determining the model's overall effectiveness. The study highlights the need for further exploration of advanced balancing techniques and alternative model architectures to address the challenges posed by class imbalance comprehensively.

Keywords: Speech recognition; Computer vision; Data balancing; Video augmentation; Neural networks.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

В. О. Ячная (Институт физиологии имени И. П. Павлова Российской академии наук; Государственный университет аэрокосмического приборостроения, Санкт-Петербург, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

Eng

V. O. Yachnaya (Pavlov Institute of Physiology, Russian Academy of Sciences; Saint Petersburg State University of Aerospace Instrumentation, Saint-Petersburg, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

+ - Библиографический список (References) Click to collapse

Рус

1. Ячная В. О., Луцив В. Р., Малашин Р. О. Современные технологии автоматического распознавания средств общения на основе визуальных данных // Компьютерная оптика. 2023. Т. 47, № 2. С. 287–305. DOI: 10.18287/2412-6179-CO-1154
2. Ячная В. О. Луцив В. Р. Автоматическое определение количества минимальных единиц языка по артикуляции // Компьютерная оптика. 2024. Т. 48, № 6. С. 956–962. DOI: 10.18287/2412-6179-CO-1451
3. The Oxford-BBC Lip Reading in the Wild (LRW) Dataset [Электронный ресурс]. URL: https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html (дата обращения: 18.10.2024).
4. Feng D., Yang S., Shan S. An Efficient Software for Building LIP Reading Models Without Pains // International Conference on Multimedia & Expo Workshops (ICMEW). 5–9 July 2021. Shenzhen, China. IEEE, 2021, P. 1–2. DOI: 10.1109/ICMEW53276.2021.9456014
5. Highly Accurate, But Still Discriminatory / A. Köchling, S. Riazy, M. C. Wehner et al. // Business & Information Systems Engineering. 2021. V. 63. P. 39–54. DOI: 10.1007/s12599-020-00673-w
6. Rangulov D., Fahim M. Emotion Recognition on large video dataset based on Convolutional Feature Extractor and Recurrent Neural Network // 4th Interna-tional Conference on Image Processing, Applications and Systems (IPAS). 09–11 December 2020. Genova, Italy. P. 14–20. DOI: 10.1109/IPAS50080.2020.9334935
7. Liu Y., You X. Specific Action Recognition Method based on Unbalanced Dataset // 2nd International Conference on Information Communication and Signal Processing (ICICSP). 28–30 September 2019. Weihai, China. P. 454–458. DOI: 10.1109/ICICSP48821.2019.8958568
8. A dataset for automatic violence detection in videos / M. Bianculli, N. Falcionelli, P. Sernani et al // Data in Brief. December 2020. Т. 33. DOI: 10.1016/j.dib.2020.106587
9. Kirichenko L., Radivilova T., Sydorenko B., Yakovlev S. Detection of Shoplifting on Video Using a Hybrid Network // Computation. 2022. Т. 10, №11:199. DOI: 10.3390/computation10110199
10. FetReg: Placental Vessel Segmentation and Registration in Fetoscopy Challenge Dataset / S. Bano, A. Casella, F. Vasconcelos et al. // 2021. P. 1–17.
11. Classification of Underwater Fish Images and Videos via Very Small Convolutional Neural Networks / M. Paraschiv, R. Padrino, P. Casari et al. // Journal of Marine Science and Engineering. 2022. V. 10, № 6:736. DOI: 10.3390/jmse10060736
12. Li Y., Yang X., Shang X., Chua T-S. 2021. Interventional Video Relation Detection // Proceedings of the 29th ACM International Conference on Multimedia (MM '21). 20–24 October 2021. P. 4091–4099. DOI: 10.1145/3474085.3475540
13. Paulin G., Ivasic‐Kos M. Review and analysis of synthetic dataset generation methods and techniques for application in computer vision // Artificial Intelligence Review. 2023. V. 56. P. 9221–9265. DOI: 10.1007/s10462-022-10358-3
14. Paproki A., Salvado O., Fookes C. Synthetic Data for Deep Learning in Computer Vision & Medical Imaging: A Means to Reduce Data Bias // ACM Computing Surveys. 2024. Т. 56, 11. С. 1–37. DOI: 10.1145/3663759
15. Presotto R., Ek S., Civitarese G., Portet F., Lalanda P., Bettini C. Combining Public Human Activity Recognition Datasets to Mitigate Labeled Data Scarcity // IEEE SMARTCOM. 26–30 June 2023. С. 1–13. [Электронный ресурс]. DOI: 10.48550/arXiv.2306.13735. URL: https://arxiv.org/pdf/2306.13735 (дата обращения: 29.05.2025).
16. O'Gara S., McGuinness K. Comparing data augmentation strategies for deep image classification // IMVIP 2019: Irish Machine Vision & Image Processing, Technological University Dublin, Dublin, Ireland. 28–30 August 2019. С. 1–9. [Электронный ресурс]. DOI: 10.21427/148b-ar75. URL: https://arrow.tudublin.ie/impstwo/7/ (дата обращения: 29.05.2025).
17. Yachnaya V.O., Mikhalkova M.A. Methods of increasing training data for a 3D neural network for Alzheimer's disease diagnosis // 2024 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF), St. Petersburg, Russian Federation. 3–7 June 2024. С. 1–6. DOI: 10.1109/WECONF61770.2024.10564628
18. Mahmoodi N., Shirazi H., Fakhredanesh M. Automatically weighted focal loss for imbalance learning // Neural Computing & Applications. 2025. Т. 37. С. 4035–4052. [Электронный ресурс]. DOI: 10.1007/s00521-024-10323-x. URL: https://www.researchgate.net/publication/387225390_Automatically_weighted_focal_loss_for_imbalance_learning (дата обращения: 29.05.2025).
19. Perez-Martin J., Bustos B. Pérez J. Attentive Visual Semantic Specialized Network for Video Captioning // 25th International Conference on Pattern Recognition (ICPR). 10–15 January 2021. C. 5767–5774. [Электронный ресурс]. DOI: 10.1109/ICPR48806.2021.9412898. URL: https://ieeexplore.ieee.org/document/9412898 (дата обращения: 29.05.2025).
20. Afouras T., Chung J.S., Senior A., Vinyals O., Zisserman A. Deep Audio-visual Speech Recognition // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018. [Электронный ресурс]. DOI: 10.1109/TPAMI.2018.2889052. URL: https://ieeexplore.ieee.org/document/8585066 (дата обращения: 29.05.2025).
21. Ячная В. О. Адаптация видео-базы LRS2 для системы распознавания слов по артикуляции // XVI Международная конференция «Прикладная оптика – 2024». 17–18 декабря 2024. Санкт-Петербург, Россия.
22. Video Augmentation Techniques for Deep Learning [Электронный ресурс]. URL: https://github.com/okankop/vidaug (дата обращения: 18.10.2024).
23. Afouras T., Chung J.S., Zisserman A. Deep Lip Reading: a comparison of models and an online application. 2018. [Электронный ресурс]. DOI: 10.48550/arXiv.1806.06053. URL: https://arxiv.org/pdf/1806.06053 (дата обращения: 29.05.2025).

Eng

1. Yachnaya, V. O., Lutsiy, V. R., & Malashin, R. O. (2023). Modern technologies for automatic recogni-tion of communication means based on visual data. Kompiuternaia optika, 47(2), 287–305. [in Russian language]. https://doi.org/10.18287/2412-6179-CO-1154
2. Yachnaya, V. O., & Lutsiy, V. R. (2024). Automatic determination of the number of minimal language units by articulation. Kompiuternaia optika, 48(6), 956–962. [in Russian language]. https://doi.org/10.18287/2412-6179-CO-1451
3. The Oxford-BBC Lip Reading in the Wild (LRW) Dataset. (n.d.). Retrieved October 18, 2024, from https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html
4. Feng, D., Yang, S., & Shan, S. (2021). An efficient software for building lip reading models without pains. In 2021 IEEE International Conference on Multi-media & Expo Workshops (ICMEW) (pp. 1–2). https://doi.org/10.1109/ICMEW53276.2021.9456014
5. Köchling, A., Riazy, S., Wehner, M. C., et al. (2021). Highly accurate, but still discriminatory. Business & Information Systems Engineering, 63, 39–54. https://doi.org/10.1007/s12599-020-00673-w
6. Rangulov, D., & Fahim, M. (2020). Emotion recognition on large video dataset based on convolutional feature extractor and recurrent neural network. In 2020 4th International Conference on Image Processing, Applications and Systems (IPAS) (pp. 14–20). https://doi.org/10.1109/IPAS50080.2020.9334935
7. Liu, Y., & You, X. (2019). Specific action recognition method based on unbalanced dataset. In 2019 2nd International Conference on Information Communication and Signal Processing (ICICSP) (pp. 454–458). https://doi.org/10.1109/ICICSP48821.2019.8958568
8. Bianculli, M., Falcionelli, N., Sernani, P., et al. (2020). A dataset for automatic violence detection in videos. Data in Brief, 33, Article 106587. https://doi.org/10.1016/j.dib.2020.106587
9. Kirichenko, L., Radivilova, T., Sydorenko, B., & Yakovlev, S. (2022). Detection of shoplifting on video using a hybrid network. Computation, 10(11), 199. https://doi.org/10.3390/computation10110199
10. Bano, S., Casella, A., Vasconcelos, F., et al. (2021). FetReg: Placental vessel segmentation and registration in fetoscopy challenge dataset.
11. Paraschiv, M., Padrino, R., Casari, P., et al. (2022). Classification of underwater fish images and videos via very small convolutional neural networks. Jour-nal of Marine Science and Engineering, 10(6), 736. https://doi.org/10.3390/jmse10060736
12. Li, Y., Yang, X., Shang, X., & Chua, T.-S. (2021). Interventional video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia (pp. 4091–4099). https://doi.org/10.1145/3474085.3475540
13. Paulin, G., & Ivasic-Kos, M. (2023). Review and analysis of synthetic dataset generation methods and techniques for application in computer vision. Artificial Intelligence Review, 56, 9221–9265. https://doi.org/10.1007/s10462-022-10358-3
14. Paproki, A., Salvado, O., & Fookes, C. (2024). Synthetic data for deep learning in computer vision & medical imaging: A means to reduce data bias. ACM Computing Surveys, 56(11), 1–37. https://doi.org/10.1145/3663759
15. Presotto, R., Ek, S., Civitarese, G., Portet, F., Lalanda, P., & Bettini, C. (2023). Combining public hu-man activity recognition datasets to mitigate labeled data scarcity. IEEE SMARTCOM, 1–13. https://doi.org/10.48550/arXiv.2306.13735
16. O'Gara, S., & McGuinness, K. (2019). Compar-ing data augmentation strategies for deep image classifi-cation. IMVIP 2019: Irish Machine Vision & Image Pro-cessing, 1–9. https://doi.org/10.21427/148b-ar75
17. Yachnaya, V. O., & Mikhalkova, M. A. (2024). Methods of increasing training data for a 3D neural net-work for Alzheimer's disease diagnosis. In 2024 Wave Electronics and its Application in Information and Tele-communication Systems (WECONF) (pp. 1–6). https://doi.org/10.1109/WECONF61770.2024.10564628
18. Mahmoodi, N., Shirazi, H., & Fakhredanesh, M. (2025). Automatically weighted focal loss for imbalance learning. Neural Computing & Applications, 37, 4035–4052. https://doi.org/10.1007/s00521-024-10323-x
19. Perez-Martin, J., Bustos, B., & Pérez, J. (2021). Attentive visual semantic specialized network for video captioning. In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 5767–5774). https://doi.org/10.1109/ICPR48806.2021.9412898
20. Afouras, T., Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2018). Deep audiovisual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2018.2889052
21. Yachnaya, V. O. (2024). Adaptation of the LRS2 video database for a word recognition system by articulation. In XVI International Conference "Applied Optics - 2024". Saint Petersburg, Russia. [in Russian language]
22. Video Augmentation Techniques for Deep Learning. (n.d.). Retrieved October 18, 2024, from https://github.com/okankop/vidaug
23. Afouras, T., Chung, J. S., & Zisserman, A. (2018). Deep lip reading: a comparison of models and an online application. https://doi.org/10.48550/arXiv.1806.06053

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 700 руб. (в том числе НДС 20%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2025.07.pp.015-022

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 700 rubles. (including VAT 20%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2025.07.pp.015-022

and fill out the form