Multimodal AI Researcher at PolyU URIS · Full-Stack Engineer · 2 papers under review at ACM MM 2026 & KDD 2026
CS undergrad at The Hong Kong Polytechnic University (GPA 3.85), selected for the Undergraduate Research & Innovation Scheme (URIS). My work investigates layered VLM architectures — whether keeping detection and tracking out of the reasoning loop improves latency, interpretability, and hallucination rate in multi-turn dialogue.
Alongside the URIS project, I've contributed to two conference submissions: a traffic-scene VQA benchmark (ACM MM 2026) and a cross-discipline survey generation benchmark (KDD 2026). I also build full-stack AI applications in production — deployed, used, and iterated.
Layered VLM system — YOLOv8 + ByteTrack + Qwen2.5-VL on-demand. Investigating latency vs. interpretability trade-offs over multi-turn dialogue.
CCD-VQA (traffic understanding, 1,194 QA pairs) and SurveyLens (ASG benchmark, 10 disciplines). Both under review.
A VQA benchmark targeting traffic accident understanding — an area where current VLMs show a systematic reasoning failure. 199 accident videos filtered from 1,500 via YOLOv8 kinematic analysis; 1,194 QA pairs across six dimensions: Weather/Light, Traffic Environment, Road Configuration, Accident Type, Accident Cause, and Accident Prevention. Sentence-BERT controls distractor quality; a Benchmark Suitability Score (BSS) grounded in Item Response Theory suppresses random-guessing shortcuts.
Existing ASG benchmarks are CS-biased and use generic metrics. SurveyLens-1k covers 10 disciplines (Biology, Business, CS, Education, Engineering, Environmental Science, Medicine, Physics, Psychology, Sociology) with 100 human-written surveys each. Evaluation uses discipline-aware rubric scoring (LLM-as-judge with Bradley-Terry preference-aligned weights) and canonical alignment metrics (RAMS + TAMS). Key finding: Deep Research agents produce richer narratives but lose structural precision; vanilla LLMs outperform specialised ASG systems in humanities.
Investigates whether a layered detector-tracker-VLM pipeline (YOLOv8 + pluggable tracker + Qwen2.5-VL on-demand) achieves lower latency and higher interpretability than end-to-end VLM querying. Object Registry maintains temporal event logs; Reference Resolver handles spatial and ordinal disambiguation. Evaluation Lab produces structured JSON metrics: json_valid_rate, clarification_rate, reference_resolution_rate, cache_hit_rate.
Languages, frameworks, and infrastructure used across research and production.
Production deployments, research tooling, and AI systems.
AI slide lecture assistant — upload any PPT, get page-by-page explanations in a split-screen view. Two Qwen models run in parallel: vision for diagram parsing, text for cross-slide narrative. Prior-slide summaries passed forward as context to maintain coherence.
University-endorsed project directly supervised by Prof. Jiannong Cao (Vice President for Research & Innovation, PolyU) — among the highest-level institutional endorsements available to an undergraduate. Deployed for official PolyU use.
Four-service AI interview platform: self-hosted Qwen3 (no cloud LLM dependency) for multi-style adaptive questioning, Azure Neural TTS real-time streaming replacing offline EdgeTTS for lower latency and expressive prosody, Wav2Lip + WebRTC lip-synced avatar, and a 3-tier KSA/STAR evaluation framework.
US equity research terminal combining deterministic financial models (DCF, valuation ratios), SEC/IR data ingestion, and LLM-assisted narrative analysis. Generates structured research reports ranked by actionability.
iOS/macOS diary app where you write what you're planning and GPT-4o converts it to calendar events. Handles relative time expressions ("next Tuesday afternoon"), offline-first state with Supabase, and conflict resolution on reconnect.
Tracks and curates the latest papers from top CS venues via arXiv API. Intelligent filtering, topic categorisation, and weekly digest delivery. Useful for staying current across ACL, NeurIPS, ICLR, and CVPR tracks.
Full-stack Gomoku with a Minimax + Alpha-Beta pruning engine and real-time D3.js game-tree visualisation over WebSocket. Designed as both a playable game and a visual demonstration of adversarial search.
Multimodal AI researcher at PolyU · Full-stack engineer · drop me an email or check the GitHub.