PhD Thesis

I completed my PhD in Computer Engineering at Boğaziçi University in 2023.

Title: Source Separation via Weakly Supervised and Unsupervised Deep Learning
Advisors: Ali Taylan Cemgil, Serap Kırbız, and Cem Say

Work from this thesis appears in the following publications:

E. Karamatlı, A. T. Cemgil and S. Kırbız, “Audio Source Separation Using Variational Autoencoders and Weak Class Supervision,” IEEE Signal Processing Letters, 2019 [arXiv] [IEEE Xplore] [Code & Audio Samples]
E. Karamatlı and S. Kırbız, “MixCycle: Unsupervised Speech Separation via Cyclic Mixture Permutation Invariant Training,” IEEE Signal Processing Letters, 2022 [arXiv] [IEEE Xplore] [Code & Audio Samples]
E. Karamatlı, A. T. Cemgil and S. Kırbız, “Source Separation and Classification Using Generative Adversarial Networks and Weak Class Supervision,” Digital Signal Processing, 2024 [Elsevier]

Abstract #

Source separation has been a key research area over the past several decades, and the emergence of deep learning approaches has revolutionized the field. Although supervised methods have been the pillars of this revolution, training such models often requires synthetic mixture datasets that may not represent real-world mixture signals adequately. In this thesis, we focus on single-channel source separation methods that are trained without having access to the underlying isolated source signals. This enables the training of source separation models solely on real-world mixture recordings that do not have corresponding source signals at hand. Therefore, it enables the models to be trained on a large amount of unlabeled or weakly-labeled data without additional labeling effort. We approach this problem in several different ways. First, we start with developing a decomposition-based weakly-supervised model that utilizes the class labels of the sources that are present in mixtures. We apply this weak class supervision approach to superimposed handwritten digit images using both non-negative matrix factorization (NMF) and generative adversarial networks (GANs). Second, we introduce another decomposition-based model that employs variational autoencoders (VAEs) to apply our weak class supervision approach to audio signals. Third, we introduce two purely unsupervised methods, which are trained exclusively on the mixture signals in a self-supervised fashion. The results of our experiments demonstrate that the proposed weakly-supervised and unsupervised methods are viable and mostly on par with the fully supervised baselines. We conclude that it is possible to replace supervised training with weakly-supervised and unsupervised methods in compatible real-world applications for better results.