Posts by Collection

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

On the Development of a Large Scale Corpus for Native Language Identification

17th International Workshop on Treebanks and Linguistic Theories (TLT), Dec 2018

Building a dataset for Native Language Identification from language learning forums.

Ask me in your own words: paraphrasing for multitask question answering

PeerJ Computer Science, Oct 2021

Using paraphrasing to improve performance across multitask question answering benchmarks.

MuLD: The Multitask Long Document Benchmark

Proceedings of the Language Resources and Evaluation Conference (LREC 2022), Jun 2022

A new long document benchmark consisting of only documents over 10,000 tokens

Can Text Encoders be Deceived by Length Attack?

The 14th International Conference on Learning Representations (ICLR 2023), May 2023

An editing method is proposed that can effectively improve the robustness of models against length attacks and can be attributed to reduced length information in the embeddings, more robust intra-document token interaction.

Towards more human-like language models based on contextualizer pretraining strategy

EMNLP/CoNLL 2023 BabyLM Challenge, May 2023

A contextualizer pretraining strategy to produce more human-like language models; winner of the EMNLP/CoNLL BabyLM Challenge Loose Track.

Length is a Curse and a Blessing for Document-level Semantics

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Dec 2023

We show that contrastive learning models are sensitive to text length in ways that distort semantic representations, and propose a length-agnostic framework that improves robustness and retrieval performance.

Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

IEEE Transactions on Neural Networks and Learning Systems, Jul 2025

Pixel-space spatiotemporal transformers for predicting the future states of dynamic physical simulations.

Everything is a video: Unifying modalities through next-frame prediction

International Conference on Computer Vision (ICCV 2025), Oct 2025

A unified framework that treats all modalities as video and learns through next-frame prediction.

Multimodal Models for Skin Cancer Classification using Clinical Free Text and Dermatoscopic Images

Nature Communications Medicine, Mar 2026

Building a model which consideres clinical freetext alongside dermatology images.

Automating the quality monitoring of a hospital discharge summary improvement project utilising large language models

npj Digital Medicine, Apr 2026

Using large language models to automatically monitor and improve the quality of hospital discharge summaries.

Generalizable multilingual medical text de-identification using generative instruction tuning

Nature Communications Medicine, Apr 2026

Using synthetic data to build robust anonymization models.

Bridging Survival Analysis and Machine Learning to Improve Healthy Life Expectancy Estimation using PHR Records

npj Digital Medicine, May 2026

Combining survival analysis with machine learning to estimate healthy life expectancy from personal health records.

talks

NLP for Care Transitions

Presenting my work on using NLP for exploring hospital discharge summaries.