Support for young scientists from all over the world
Doctoral projects
© Laure Ciernik

Advanc­ing rare disease classi­fi­ca­tion: explor­ing repre­sen­ta­tion learn­ing in low-data and heavy tail settings

Laure Ciernik — Hector Fellow Klaus-Robert Müller

This project seeks to advance rare disease classi­fi­ca­tion using deep neural networks by address­ing key challenges such as limited data and high hetero­gene­ity. We will assess exist­ing models and their repre­sen­ta­tions, inves­ti­gat­ing how techni­cal varia­tions in medical data affect their charac­ter­is­tics. Addition­ally, we will utilize repre­sen­ta­tion learn­ing on large, unlabeled datasets, with an empha­sis on DNA methy­la­tion data, to improve knowl­edge trans­fer from common to rare diseases. Finally, we will also focus on address­ing label imbal­ance in model train­ing, ensur­ing more accurate and well-calibrated disease predictions.

Accurate disease classi­fi­ca­tion is crucial for timely diagno­sis and tailored treat­ments. However, this becomes challeng­ing with rare diseases due to limited data, high hetero­gene­ity, and complex­ity. This project focuses on rare disease classi­fi­ca­tion with deep neural networks. We will use models that project data into repre­sen­ta­tion spaces that capture the seman­tic categories of diseases. Several aspects will be explored.

First, we will assess exist­ing models and their learned repre­sen­ta­tions, evalu­at­ing their charac­ter­is­tics and similar­i­ties and identi­fy­ing desir­able traits. Addition­ally, we aim to inves­ti­gate how techni­cal varia­tions in medical data, such as data sources and patient charac­ter­is­tics known as batch effects, affect model repre­sen­ta­tions and explore methods for mitigat­ing them.

Secondly, we would like to use repre­sen­ta­tion learn­ing on large, unlabeled datasets to capture biolog­i­cal patterns across various condi­tions. This approach has been shown to facil­i­tate knowl­edge trans­fer from common to rare diseases. Our primary focus will be on utiliz­ing DNA methy­la­tion data, which, to the best of our knowl­edge, lacks models found for other data types like histopathol­ogy and single-cell RNA sequencing.

Finally, we will address label imbal­ance, which remains an issue even with good repre­sen­ta­tions. There­fore, we will inves­ti­gate train­ing techniques for the downstream disease classi­fi­ca­tion task that can handle this imbal­ance and produce well-calibrated predictions.

Representation learning in low-data and heavy tail settings.

Repre­sen­ta­tion learn­ing in low-data and heavy tail settings.

Laure Ciernik

Laure Ciernik

Technis­che Univer­sität Berlin

Super­vised by

Prof. Dr.

Klaus-Robert Müller

Infor­mat­ics, Mathe­mat­ics & Physics

Hector Fellow since 2023Disziplinen Bernhard Schölkopf