Multimodal AI Researcher  ·  PolyU URIS

Haochen Shi

石昊宸

Multimodal AI Researcher at PolyU URIS · Full-Stack Engineer · 2 papers under review at ACM MM 2026 & KDD 2026

3.85
GPA
2
Papers Under Review
URIS
Funded Research
12+
AI Projects
Download CV (EN) 下载简历 (中文) View Research haochen8.shi@connect.polyu.hk →
Scroll
ACM MM 2026 CCD-VQA Traffic Understanding Benchmark
KDD 2026 SurveyLens Discipline-Aware ASG Benchmark
URIS PolyU Undergraduate Research & Innovation Scheme
Multimodal AI VLM Systems · Home Environment Dialogue
ACM MM 2026 CCD-VQA Traffic Understanding Benchmark
KDD 2026 SurveyLens Discipline-Aware ASG Benchmark
URIS PolyU Undergraduate Research & Innovation Scheme
Multimodal AI VLM Systems · Home Environment Dialogue

Building multimodal AI systems where research meets production

CS undergrad at The Hong Kong Polytechnic University (GPA 3.85), selected for the Undergraduate Research & Innovation Scheme (URIS). My work investigates layered VLM architectures — whether keeping detection and tracking out of the reasoning loop improves latency, interpretability, and hallucination rate in multi-turn dialogue.

Alongside the URIS project, I've contributed to two conference submissions: a traffic-scene VQA benchmark (ACM MM 2026) and a cross-discipline survey generation benchmark (KDD 2026). I also build full-stack AI applications in production — deployed, used, and iterated.

CS @ PolyU · GPA 3.85 · URIS Research Scheme
Full-stack engineer with production deployments — Next.js, FastAPI, SQLite, Cloudflare Tunnel
Current Research

Multimodal Home Environment Interaction

Layered VLM system — YOLOv8 + ByteTrack + Qwen2.5-VL on-demand. Investigating latency vs. interpretability trade-offs over multi-turn dialogue.

Conference Submissions

ACM MM 2026 & KDD 2026

CCD-VQA (traffic understanding, 1,194 QA pairs) and SurveyLens (ASG benchmark, 10 disciplines). Both under review.

Academic Work

CCD-VQA: Benchmark Construction and Evaluation for Traffic Circumstances Understanding in Multimodal Large Language Models

Under Review
ACM International Conference on Multimedia (ACM MM 2026) · Zhou Letian, Haochen Shi, Wei Lou

A VQA benchmark targeting traffic accident understanding — an area where current VLMs show a systematic reasoning failure. 199 accident videos filtered from 1,500 via YOLOv8 kinematic analysis; 1,194 QA pairs across six dimensions: Weather/Light, Traffic Environment, Road Configuration, Accident Type, Accident Cause, and Accident Prevention. Sentence-BERT controls distractor quality; a Benchmark Suitability Score (BSS) grounded in Item Response Theory suppresses random-guessing shortcuts.

199 accident videos
1,194 QA pairs · 6 dimensions
Gemini: 77.55%
Open-source models: ~64%
Causal reasoning gap: −20–25%

SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Under Review
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026) · Beichen Guo, Zhiyuan Wen, Gu Jia, Senzhang Wang, Ruosong Yang, Shuaiqi Liu, Haochen Shi

Existing ASG benchmarks are CS-biased and use generic metrics. SurveyLens-1k covers 10 disciplines (Biology, Business, CS, Education, Engineering, Environmental Science, Medicine, Physics, Psychology, Sociology) with 100 human-written surveys each. Evaluation uses discipline-aware rubric scoring (LLM-as-judge with Bradley-Terry preference-aligned weights) and canonical alignment metrics (RAMS + TAMS). Key finding: Deep Research agents produce richer narratives but lose structural precision; vanilla LLMs outperform specialised ASG systems in humanities.

10 disciplines · 100 surveys each
11 methods evaluated
RAMS + TAMS alignment metrics
Bradley-Terry rubric scoring

Multimodal LLM-Based Home Environment Interaction System

URIS Funded · PolyU
Undergraduate Research & Innovation Scheme · Sole Investigator
Highly selective grant — university-wide competitive scheme with ~100 students selected across all departments; among these, only 5–6 CS students were matched with a CS faculty supervisor. Official approved project list ↗

Investigates whether a layered detector-tracker-VLM pipeline (YOLOv8 + pluggable tracker + Qwen2.5-VL on-demand) achieves lower latency and higher interpretability than end-to-end VLM querying. Object Registry maintains temporal event logs; Reference Resolver handles spatial and ordinal disambiguation. Evaluation Lab produces structured JSON metrics: json_valid_rate, clarification_rate, reference_resolution_rate, cache_hit_rate.

YOLOv8 + ByteTrack perception layer
Qwen2.5-VL on-demand reasoning
Prompt caching · Fast Mode

Tools I Work With

Languages, frameworks, and infrastructure used across research and production.

Languages
Python
TS TypeScript
J Java
C++ C++
Go Go
SQL SQL
AI / ML
PyTorch
PEFT / LoRA
YOLO YOLOv8
OCV OpenCV
LangChain
Full-Stack
React
Next.js
FastAPI
Flask
Spring Boot
Docker
SQLite
PG PostgreSQL

Selected Projects

Production deployments, research tooling, and AI systems.

// EdTech Platform

Teaching-Learning

AI slide lecture assistant — upload any PPT, get page-by-page explanations in a split-screen view. Two Qwen models run in parallel: vision for diagram parsing, text for cross-slide narrative. Prior-slide summaries passed forward as context to maintain coherence.

Dual-pipeline Qwen-VL + Qwen-Chat · WAL-mode SQLite caching eliminates redundant inference on long decks
Next.js 15FastAPIQwen-VLSQLite WALCloudflare Tunnel
★ KEY PROJECT
// University-Level Initiative

PolyInterview · AI Mock Interview Platform

University-endorsed project directly supervised by Prof. Jiannong Cao (Vice President for Research & Innovation, PolyU) — among the highest-level institutional endorsements available to an undergraduate. Deployed for official PolyU use.

Four-service AI interview platform: self-hosted Qwen3 (no cloud LLM dependency) for multi-style adaptive questioning, Azure Neural TTS real-time streaming replacing offline EdgeTTS for lower latency and expressive prosody, Wav2Lip + WebRTC lip-synced avatar, and a 3-tier KSA/STAR evaluation framework.

Azure Neural TTS (streaming) → Wav2Lip → WebRTC — replaced EdgeTTS to cut latency and enable emotional expression for a natural interviewer persona
Qwen3 (self-hosted)Azure Neural TTSWav2LipWebRTCVue.jsFlaskDocker
// FinTech AI

AI Investment Terminal

US equity research terminal combining deterministic financial models (DCF, valuation ratios), SEC/IR data ingestion, and LLM-assisted narrative analysis. Generates structured research reports ranked by actionability.

Deterministic financial models + LLM narrative layer — avoids hallucinated numbers by grounding in structured data first
PythonPandasFinancial APIsAI/ML
// Mobile Experience

SmartJournal

iOS/macOS diary app where you write what you're planning and GPT-4o converts it to calendar events. Handles relative time expressions ("next Tuesday afternoon"), offline-first state with Supabase, and conflict resolution on reconnect.

Local-first state management — syncs with Supabase on reconnect with deterministic conflict resolution
React NativeExpoTypeScriptSupabaseGPT-4o
// Research Automation

PaperHunter

Tracks and curates the latest papers from top CS venues via arXiv API. Intelligent filtering, topic categorisation, and weekly digest delivery. Useful for staying current across ACL, NeurIPS, ICLR, and CVPR tracks.

Automated curation pipeline — keyword + embedding-based relevance filtering with digest delivery
PythonarXiv APIAutomation
// Game AI

Graph Gomoku AI

Full-stack Gomoku with a Minimax + Alpha-Beta pruning engine and real-time D3.js game-tree visualisation over WebSocket. Designed as both a playable game and a visual demonstration of adversarial search.

Live game-tree rendering via D3.js + WebSocket — visualises Alpha-Beta pruning in real time
JavaSpring BootReactTypeScriptD3.js

Let's Connect

Multimodal AI researcher at PolyU · Full-stack engineer · drop me an email or check the GitHub.