Promotion of interdisciplinary exchange
Interdisciplinary Projects
© 程 加星 - Adobe Stock

PRECISE-CRC: PREven­tion through Causal Infer­ence and Strat­i­fied Embed­dings in ColoRec­tal Cancer

Hector Fellow Bernhard Schölkopf

Hector RCD Awardee Carolin Schneider

PRECISE‑CRC devel­ops an innov­a­tive embed­ding frame­work that converts hetero­ge­neous lifestyle, comor­bid­ity, and care data from biobanks into struc­tured patient summaries using a context‑aware large language model (LLM) and then trans­forms these summaries into high‑dimensional vectors with a text‑embedding model. Scalable causal methods (NOTEARS, Invari­ant Causal Predic­tion, Doubly Robust Estima­tor) are applied to identify true cause‑effect relation­ships for colorec­tal cancer and to calibrate the embed­dings against objec­tive biomark­ers (omega‑3 fatty acids, BCAA, accelerometer‑derived activ­ity, metabolomics). The result­ing infor­ma­tion is integrated into a digital preven­tion dashboard that provides individ­u­al­ized risk profiles, counter­fac­tual scenar­ios and SHAP‑based expla­na­tions. The project is led by Prof. Dr. Bernhard Schölkopf and coordi­nated by Jun.-Prof. Dr. Carolin Schneider.

PRECISE‑CRC devel­ops an innov­a­tive, embedding‑based analyt­i­cal frame­work that converts the hetero­ge­neous variety of lifestyle, comor­bid­ity, and care data from large biobanks into a unified, seman­ti­cally inter­pretable format. The core of the approach is a context‑sensitive large‑language model (LLM) that creates patient‑specific text summaries from tabular question­naire and clini­cal infor­ma­tion. These narra­tives are then trans­formed with modern text‑embedding models into high‑dimensional vector repre­sen­ta­tions (embed­dings) that compactly capture the complex relation­ships among dietary habits, physi­cal activ­ity, metabolic profiles, and disease status.

To move from mere associ­a­tions to true causal­ity, the embed­dings are linked in a second analyt­i­cal layer with advanced causal machine‑learning methods. Various scalable techniques—including NOTEARS, DoPFN, amortised infer­ence for causal‑structure learn­ing, Invari­ant Causal Predic­tion (ICP) and the Doubly Robust Estima­tor (DRE)—enable the construc­tion of directed acyclic graphs that identify poten­tial cause‑effect links between lifestyle dimen­sions (e.g., alcohol consump­tion, red meat intake, seden­tary behav­ior) and the occur­rence of colorec­tal cancer (CRC). By combin­ing graph‑based discov­ery with robust infer­ence, average treat­ment effects (ATE) can be estimated precisely while select­ing features that are stable and trans­fer­able across populations.

A central quality feature of the project is the calibra­tion of the algorith­mi­cally derived embed­dings against objec­tive biomark­ers. Measure­ments such as omega‑3 fatty acids, branched‑chain amino acids, accelerometer‑derived activ­ity metrics and compre­hen­sive metabolomic profiles serve as indepen­dent valida­tion to ensure biolog­i­cal plausi­bil­ity and to avoid over‑fitting to subjec­tive self‑reports. Corre­la­tions between individ­ual embed­ding dimen­sions and the biomark­ers are quanti­fied with linear and non‑linear regres­sion models; any identi­fied discrep­an­cies are fed back into itera­tive optimi­sa­tion of prompt designs and weight­ing schemes within the embed­ding pipeline.

The complete method­ol­ogy is ultimately integrated into an inter­ac­tive preven­tion dashboard. For each user ID, an individ­ual risk profile is displayed, showing the ten‑year CRC incidence based on the embedding‑derived predic­tion together with the causal insights. In addition, the system gener­ates counter­fac­tual scenar­ios that simulate how risk would change under targeted lifestyle interventions—such as reduc­ing alcohol intake, increas­ing physi­cal activ­ity, or losing weight. Shapley‑Additive‑Explanations (SHAP) are used to visualise trans­par­ently the contri­bu­tion of each factor, allow­ing researchers, clini­cians and patients to under­stand which variables have the great­est impact on the estimated risk.

The project draws on the exten­sive data resources of the UK Biobank (≈ 500 000 partic­i­pants, includ­ing ~ 14 000 CRC cases) and the US NHANES (≈ 100 000 partic­i­pants) for exter­nal valida­tion and trans­fer testing. The combi­na­tion of synthetic text summaries, high‑dimensional embed­dings, causal graph analy­sis and biomarker calibra­tion yields a robust, cohort‑independent model that is not only predic­tive but, crucially, explanatory.

Leader­ship and Coordi­na­tion – The initia­tive is led by Prof. Dr. Bernhard Schölkopf, a world‑renowned pioneer in causal machine learn­ing, and coordi­nated clini­cally by Jun.-Prof. Dr. Carolin Schnei­der, whose exper­tise in cancer preven­tion, biomarker strat­i­fi­ca­tion and digital health bridges the method­olog­i­cal advances to patient‑centred application.

By linking seman­tic harmon­i­sa­tion, advanced causal infer­ence, objec­tive biomarker calibra­tion and a user‑friendly visual­i­sa­tion platform, PRECISE‑CRC estab­lishes a new standard for preven­tive oncol­ogy. The concept can be readily extended to other tumour entities or chronic diseases, thereby creat­ing a sustain­able infra­struc­ture for causal embed­ding models in healthcare.

PRECISE-CRC: PREvention through Causal Inference and Stratified Embeddings in ColoRectal Cancer

Figure 1: Overview of the proposed project

   

Super­vised by

Prof. Dr.

Bernhard Schölkopf

Infor­mat­ics, Physics & Mathematics

Hector Fellow since 2018Disziplinen Bernhard Schölkopf

Prof. Dr. med.

Carolin Schnei­der

Medicine

Hector RCD Awardee since 2024Disziplinen Carolin Schneider