Generalizable multilingual medical text de-identification using generative instruction tuning
Published in Nature Communications Medicine, 2026
Medical research depends on access to high quality data that protects patient privacy. Free text in health records contains valuable clinical detail, yet it often includes sensitive personal information that must be removed before use. Current approaches rely on manually created training data and focus mainly on narrow domains. They are difficult to scale to new medical fields and languages. This study aims to address these limitations by developing a framework that supports privacy-preserving use of medical text across diverse settings.
The study introduces an annotation-free framework for training and adapting LLM-based anonymization models across diverse medical domains. Our reproducible framework includes the development of a generative medical anonymization model, leveraging synthetic data and instruction tuning of generative LLMs. Performance is evaluated on both synthetic test sets and on patient requests from a digital triage service. Accuracy, recall, precision, and the ability to maintain the original meaning of non-sensitive text are assessed.
Here we show that generative models trained with the synthetic framework reach performance that exceeds strong baseline systems across several medical domains. The models preserve non-sensitive text with high fidelity and anonymize sensitive information with high accuracy. They perform well even when trained on small datasets, generalize to unseen clinical fields, and support anonymization in multiple languages without requiring additional training data in those languages.
The study presents a reproducible, annotation-free approach that enables the development of effective anonymization models for medical text. The framework reduces reliance on real patient data, lowers the cost of adaptation to new settings, and supports wider use of unstructured clinical information for research and service improvement.}
