I specialize in building multimodal AI systems that bridge vision, language, and 3D understanding. My research focuses on vision-language models, LLM fine-tuning, and generative AI, turning cutting-edge research into production-ready solutions. Prior to that, I completed my PhD at the University of Oxford's Visual Geometry Group (VGG) working on multimodal learning, generative models, and 3D reconstruction.
Open-source Python package for evaluating multimodal vision-language RAG systems. 11 metrics spanning retrieval quality, hallucination detection, faithfulness, and CLIP-based cross-modal alignment in one unified pipeline. Published on PyPI.
pip install mmeval-vrag
Production-ready CLIP-based multimodal model training only a lightweight fusion network achieving 85-95% SOTA performance with 10 times fewer parameters.
View Project →Production RAG system for decision support integrating multimodal inputs.
View Project →Conditional diffusion models for generating realistic synthetic financial time series.
View Project →Safe Reinforcement Learning with Normalizing Flows for Uncertainty Quantification in Time Series.
View Project →Developing one billion labeled masks for generalizable 3D segmentation across diverse domains, enabling large-scale training of AI models.
View Project →Automated framework using Vision Transformers to estimate 3D shape from single 2D images, achieving state-of-the-art reconstruction accuracy.
View Project →Cross-sectional diffusion model for generating complete 3D volumes from single slices, setting new benchmarks in volumetric synthesis.
View Project →