A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Posts
Everything is a video: Unifying modalities through next-frame prediction
The current ‘AI boom’ is driven by a key development - reformulating all kinds of NLP task as a single ‘supertask’ - chat dialogues.
NLP for Care Transitions
Discharge summaries are the handover from hospital to GP, but follow-up requests are often missed. Analysing thousands of summaries, we found that wording matters. Can simple AI model can spot these patterns and flag requests at risk of being overlooked, opening the door to more reliable handovers?
pyWebcamSteg: An annoymising proxy through your webcam
ExploitDB’s Google Hacking Database (GHDB) is a great resource for finding sneaky search queries for Google which lets you find public things on the web which probably shouldn’t be public.
portfolio
Portfolio item number 1
Short description of portfolio item number 1
Portfolio item number 2
Short description of portfolio item number 2 
publications
On the Development of a Large Scale Corpus for Native Language Identification
17th International Workshop on Treebanks and Linguistic Theories (TLT), Dec 2018
Building a dataset for Native Language Identification from language learning forums.
Ask me in your own words: paraphrasing for multitask question answering
PeerJ Computer Science, Oct 2021
Using paraphrasing to improve performance across multitask question answering benchmarks.
MuLD: The Multitask Long Document Benchmark
Proceedings of the Language Resources and Evaluation Conference (LREC 2022), Jun 2022
A new long document benchmark consisting of only documents over 10,000 tokens
Can Text Encoders be Deceived by Length Attack?
The 14th International Conference on Learning Representations (ICLR 2023), May 2023
An editing method is proposed that can effectively improve the robustness of models against length attacks and can be attributed to reduced length information in the embeddings, more robust intra-document token interaction.
Towards more human-like language models based on contextualizer pretraining strategy
EMNLP/CoNLL 2023 BabyLM Challenge, May 2023
A contextualizer pretraining strategy to produce more human-like language models; winner of the EMNLP/CoNLL BabyLM Challenge Loose Track.
Length is a Curse and a Blessing for Document-level Semantics
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Dec 2023
We show that contrastive learning models are sensitive to text length in ways that distort semantic representations, and propose a length-agnostic framework that improves robustness and retrieval performance.
Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
IEEE Transactions on Neural Networks and Learning Systems, Jul 2025
Pixel-space spatiotemporal transformers for predicting the future states of dynamic physical simulations.
Everything is a video: Unifying modalities through next-frame prediction
International Conference on Computer Vision (ICCV 2025), Oct 2025
A unified framework that treats all modalities as video and learns through next-frame prediction.
Multimodal Models for Skin Cancer Classification using Clinical Free Text and Dermatoscopic Images
Nature Communications Medicine, Mar 2026
Building a model which consideres clinical freetext alongside dermatology images.
Automating the quality monitoring of a hospital discharge summary improvement project utilising large language models
npj Digital Medicine, Apr 2026
Using large language models to automatically monitor and improve the quality of hospital discharge summaries.
Generalizable multilingual medical text de-identification using generative instruction tuning
Nature Communications Medicine, Apr 2026
Using synthetic data to build robust anonymization models.
Bridging Survival Analysis and Machine Learning to Improve Healthy Life Expectancy Estimation using PHR Records
npj Digital Medicine, May 2026
Combining survival analysis with machine learning to estimate healthy life expectancy from personal health records.
talks
NLP for Care Transitions
Presenting my work on using NLP for exploring hospital discharge summaries.
