Применение сверточных нейронных сетей и алгоритмов глубокого обучения для прогнозирования и идентификации голосовых дипфейков

Пономарв К.Г.; Верещагина Е.А.

Application of convolutional neural networks and deep learning algorithms for prediction and identification of voice deepfakes

Ponomarev K.G., Vereshchagina E.A.

Incoming article date: 13.12.2024

The purpose of this article is to create a convolutional neural network model for identifying and predicting audio deepfakes by classifying voice content using deep machine learning algorithms and python programming language libraries. The audio content datasets are basic for the neural network learning process and are represented by mel spectrograms. The processing of graphic images of the audio signal in the heatmap format forms the knowledge base of the convolutional neural network. The results of the visualization of mel spectrograms in the ratio of the measurement of the frequency of sound and chalk determine the key characteristics of the audio signal and provide a comparison procedure between a real voice and artificial speech. Modern speech synthesizers use a complex selection and generate synthetic speech based on the recording of a person's voice and a language model. We note the importance of mel spectrograms, including for speech synthesis models, where this type of spectrograms is used to record the timbre of a voice and encode the speaker's original speech. Convolutional neural networks allow you to automate the processing of mel spectrograms and classify voice content: original or fake. The experiments conducted on test voice sets proved the success of learning and using convolutional neural networks using images of MFCC spectral coefficients to classify and study audio content, and the use of this type of neural networks in the field of information security to identify audio deepfakes.

Keywords: neural networks, detection of voice deepfakes, information security, speech synthesis models, deep machine learning, categorical cross-entropy, loss function, algorithms for detecting voice deepfakes, convolutional neural networks, mel-spectrograms