Rencontres Statistiques du CEREMADE (Perrine Lacroix, lundi 25 mars 2024)

11 mars 24

La prochaine séance des Rencontres statistiques du CEREMADE aura lieu lundi 25 mars 2024 à 14h en salle A707. Nous aurons le plaisir d'écouter Perrine Lacroix (ENS de Lyon), qui nous présentera ses travaux sur "Non-asymptotic control of a kernel 2-sample test".

TitleNon-asymptotic control of a kernel 2-sample test

Abstract 
We are interesting in statistical tests to evaluate the hypothesis H₀: {P = Q} against its alternative H₁: {P ≠ Q}. Our data are multivariate, high-dimensional and exhibit strong dependencies between variables. We propose a comparison test of two distributions based on kernel methods: our data are first transformed via a well-chosen feature map and live in a reproducing kernel hilbert space (RKHS). Our kernel test statistic is the equivalent of the Hotelling's T2 comparison test for finite-dimensional multivariate data, and is equal to the mean embeddings difference (MMD) renormalized by a well-chosen covariance operator. 
Classically, these non-parametric tests are either calibrated asymptotically, or via test aggregation techniques. Here, we propose to calibrate the test at a given fixed sample size by obtaining non-asymptotic bounds on our test statistic. For this, a regularization is required to approximate the covariance operator via its empirical estimator. Unlike the approaches of Harchaoui et al. (2007) or Hagrass et al. (2023) using $L_2$ regularizations, we propose spectral truncation. This method fixes the unknown number $T$ of eigenfunctions to reconstruct the covariance operator and provides the additional advantage of data visualization.
Currently, at a fixed $T$, the test statistic, called the truncated kernel Fisher Discriminant Ratio (KFDA\_T), provides a test whose asymptotic calibration is known (Ozier-Lafontaine et al. (2023)). In this talk, I will present how to theoretically and non-asymptotically bound the p-value of the test associated with the KFDA\_T. This bound is a first step in defining a good calibration of the hyperparameter $T$.
In applications, this statistical question is essential in the field of genomics, wherethe two groups are composed of single-cell RNA-seq data. The goal is to detect distinct or similar biological behavior between the groups.

Joint work with Bertrand Michel (Université de Nantes), Franck Picard (ENS de Lyon) and Vincent Rivoirard (Paris-Dauphine).