Publications

My most important publications, on security, health and translation, applied to speech and handwriting, are posted under here:

CLUSTERING UNSUPERVISED REPRESENTATIONS AS DEFENSE AGAINST POISONING ATTACKS ON SPEECH COMMANDS CLASSIFICATION SYSTEM

Published in Workshop on Automatic Speech Recognition and Understanding (ASRU 2023), 2023

Poisoning attacks entail attackers intentionally tampering with training data. In this paper, we consider a dirty-label poisoning attack scenario on a speech commands classifi- cation system. The threat model assumes that certain utter- ances from one of the classes (source class) are poisoned by superimposing a trigger on it, and its label is changed to another class selected by the attacker (target class). We propose a filtering defense against such an attack. First, we use DIstillation with NO labels (DINO) to learn unsupervised representations for all the training examples. Next, we use K-means and LDA to cluster these representations. Finally, we keep the utterances with the most repeated label in their cluster for training and discard the rest. For a 10% poisoned source class, we demonstrate a drop in attack success rate from 99.75% to 0.25%. We test our defense against a variety of threat models, including different target and source classes, as well as trigger variations.

Download here

JHU IWSLT 2023 Dialect Speech Translation System Description

Published in Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), 2023

This paper presents JHU’s submissions to the IWSLT 2023 dialectal and low-resource track of Tunisian Arabic to English speech translation. The Tunisian dialect lacks formal orthography and abundant training data, making it challenging to develop effective speech translation (ST) systems. To address these challenges, we explore the integration of large pre-trained machine translation (MT) models, such as mBART and NLLB-200 in both end-to-end (E2E) and cascaded speech translation (ST) systems. We also improve the performance of automatic speech recognition (ASR) through the use of pseudo-labeling data augmentation and channel matching on telephone data. Finally, we combine our E2E and cascaded ST systems with Minimum Bayes-Risk decoding. Our combined system achieves a BLEU score of 21.6 and 19.1 on test2 and test3, respectively.

Download here

Published in , 1900

On the invertibility of a voice privacy system using embedding alignment

Published in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021

This paper explores various attack scenarios on a voice anonymization system using embeddings alignment techniques. We use Wasserstein-Procrustes (an algorithm initially designed for unsupervised translation) or Procrustes analysis to match two sets of -vectors, before and after voice anonymization, to mimic this transformation as a rotation function. We compute the optimal rotation and compare the results of this approximation to the official Voice Privacy Challenge results. We show that a complex system like the baseline of the Voice Privacy Challenge can be approximated by a rotation, estimated using a limited set of -vectors. This paper studies the space of solutions for voice anonymization within the specific scope of rotations. Rotations being reversible, the proposed method can recover up to 62% of the speaker identities from anonymized embeddings.

Download here

Spoofing speaker verification with voice style transfer and reconstruction loss

Published in 2021 IEEE International Workshop on Information Forensics and Security (WIFS), 2021

In this paper we investigate a template reconstruction attack against a speaker verification system. A stolen speaker embedding is processed with a zero-shot voice-style transfer system to reconstruct a Mel-spectrogram containing as much speaker information as possible. We assume the attacker has a black box access to a state-of-the-art automatic speaker verification system. We modify the AutoVC voice-style transfer system to spoof the automatic speaker verification system. We find that integrating a new loss targeting embedding reconstruction and optimizing training hyper-parameters significantly improves spoofing. Results obtained for speaker verification are similar to other biometrics, such as handwritten digits or face verification. We show on standard corpora (VoxCeleb and VCTK) that the reconstructed Mel-spectrograms contain enough speaker characteristics to spoof the original authentication system.

Download here

Handwritten digits reconstruction from unlabelled embeddings

Published in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

In this paper, we investigate template reconstruction attack of touchscreen biometrics, based on handwritten digits writer verification. In the event of a template database theft, we show that reconstructing the original drawn digit from the embeddings is possible without access to the original embedding encoder. Using an external labelled dataset, an attack encoder is trained along with a Mixture Density Recurrent Neural Network decoder. Thanks to an alignment flow, initialized with Linear Discriminant Analysis and Procrustes, the transfer function between the output space of the original and the attack encoder is estimated. The successive application of transfer function and decoder to the stolen embeddings allows to reconstruct the original drawings, which can be used to spoof the behavioural biometrics system.

Download here

Unsupervised labelling of stolen handwritten digit embeddings with density matching

Published in International Conference on Applied Cryptography and Network Security, 2020

Biometrics authentication is now widely deployed, and from that omnipresence comes the necessity to protect private data. Recent studies proved touchscreen handwritten digits to be a reliable biometrics. We set a threat model based on that biometrics: in the event of theft of unlabelled embeddings of handwritten digits, we propose a labelling method inspired by recent unsupervised translation algorithms. Provided a set of unlabelled embeddings known to have been produced by a Long Short Term Memory Recurrent Neural Network (LSTM RNN), we demonstrate that inferring their labels is possible. The proposed approach involves label-wise clustering of the embeddings and label identification of each group by matching their distribution to the label-relative classes of a comparison hand-crafted labeled set of embeddings.

Download here