Haochen Shi

About Me

Building multimodal AI systems where research meets production

CS undergrad at The Hong Kong Polytechnic University (GPA 3.85), selected for the Undergraduate Research & Innovation Scheme (URIS). My work investigates layered VLM architectures — whether keeping detection and tracking out of the reasoning loop improves latency, interpretability, and hallucination rate in multi-turn dialogue.

Alongside the URIS project, I've contributed to two conference submissions: a traffic-scene VQA benchmark (ACM MM 2026) and a cross-discipline survey generation benchmark (KDD 2026). I also build full-stack AI applications in production — deployed, used, and iterated.

CS @ PolyU · GPA 3.85 · URIS Research Scheme

Full-stack engineer with production deployments — Next.js, FastAPI, SQLite, Cloudflare Tunnel

Current Research

Multimodal Home Environment Interaction

Layered VLM system — YOLOv8 + ByteTrack + Qwen2.5-VL on-demand. Investigating latency vs. interpretability trade-offs over multi-turn dialogue.

Conference Submissions

ACM MM 2026 & KDD 2026

CCD-VQA (traffic understanding, 1,194 QA pairs) and SurveyLens (ASG benchmark, 10 disciplines). Both under review.

Publications & Research

Academic Work

CCD-VQA: Benchmark Construction and Evaluation for Traffic Circumstances Understanding in Multimodal Large Language Models

Under Review

ACM International Conference on Multimedia (ACM MM 2026) · Zhou Letian, Haochen Shi, Wei Lou

A VQA benchmark targeting traffic accident understanding — an area where current VLMs show a systematic reasoning failure. 199 accident videos filtered from 1,500 via YOLOv8 kinematic analysis; 1,194 QA pairs across six dimensions: Weather/Light, Traffic Environment, Road Configuration, Accident Type, Accident Cause, and Accident Prevention. Sentence-BERT controls distractor quality; a Benchmark Suitability Score (BSS) grounded in Item Response Theory suppresses random-guessing shortcuts.

199 accident videos

1,194 QA pairs · 6 dimensions

Gemini: 77.55%

Open-source models: ~64%

Causal reasoning gap: −20–25%

SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Under Review

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026) · Beichen Guo, Zhiyuan Wen, Gu Jia, Senzhang Wang, Ruosong Yang, Shuaiqi Liu, Haochen Shi

Existing ASG benchmarks are CS-biased and use generic metrics. SurveyLens-1k covers 10 disciplines (Biology, Business, CS, Education, Engineering, Environmental Science, Medicine, Physics, Psychology, Sociology) with 100 human-written surveys each. Evaluation uses discipline-aware rubric scoring (LLM-as-judge with Bradley-Terry preference-aligned weights) and canonical alignment metrics (RAMS + TAMS). Key finding: Deep Research agents produce richer narratives but lose structural precision; vanilla LLMs outperform specialised ASG systems in humanities.

10 disciplines · 100 surveys each

11 methods evaluated

RAMS + TAMS alignment metrics

Bradley-Terry rubric scoring

Multimodal LLM-Based Home Environment Interaction System

URIS Funded · PolyU

Undergraduate Research & Innovation Scheme · Sole Investigator

★ Highly selective grant — university-wide competitive scheme with ~100 students selected across all departments; among these, only 5–6 CS students were matched with a CS faculty supervisor. Official approved project list ↗

Investigates whether a layered detector-tracker-VLM pipeline (YOLOv8 + pluggable tracker + Qwen2.5-VL on-demand) achieves lower latency and higher interpretability than end-to-end VLM querying. Object Registry maintains temporal event logs; Reference Resolver handles spatial and ordinal disambiguation. Evaluation Lab produces structured JSON metrics: json_valid_rate, clarification_rate, reference_resolution_rate, cache_hit_rate.

YOLOv8 + ByteTrack perception layer

Qwen2.5-VL on-demand reasoning

Prompt caching · Fast Mode

Featured Work

Selected Projects

Production deployments, research tooling, and AI systems.

// EdTech Platform

Teaching-Learning

AI slide lecture assistant — upload any PPT, get page-by-page explanations in a split-screen view. Two Qwen models run in parallel: vision for diagram parsing, text for cross-slide narrative. Prior-slide summaries passed forward as context to maintain coherence.

Dual-pipeline Qwen-VL + Qwen-Chat · WAL-mode SQLite caching eliminates redundant inference on long decks

Next.js 15FastAPIQwen-VLSQLite WALCloudflare Tunnel

Live Demo ↗ GitHub ↗

★ KEY PROJECT

// University-Level Initiative

PolyInterview · AI Mock Interview Platform

University-endorsed project directly supervised by Prof. Jiannong Cao (Vice President for Research & Innovation, PolyU) — among the highest-level institutional endorsements available to an undergraduate. Deployed for official PolyU use.

Four-service AI interview platform: self-hosted Qwen3 (no cloud LLM dependency) for multi-style adaptive questioning, Azure Neural TTS real-time streaming replacing offline EdgeTTS for lower latency and expressive prosody, Wav2Lip + WebRTC lip-synced avatar, and a 3-tier KSA/STAR evaluation framework.

Azure Neural TTS (streaming) → Wav2Lip → WebRTC — replaced EdgeTTS to cut latency and enable emotional expression for a natural interviewer persona

Qwen3 (self-hosted)Azure Neural TTSWav2LipWebRTCVue.jsFlaskDocker

GitHub ↗

// FinTech AI

AI Investment Terminal

US equity research terminal combining deterministic financial models (DCF, valuation ratios), SEC/IR data ingestion, and LLM-assisted narrative analysis. Generates structured research reports ranked by actionability.

Deterministic financial models + LLM narrative layer — avoids hallucinated numbers by grounding in structured data first

PythonPandasFinancial APIsAI/ML

Live Demo ↗ GitHub ↗

// Mobile Experience

SmartJournal

iOS/macOS diary app where you write what you're planning and GPT-4o converts it to calendar events. Handles relative time expressions ("next Tuesday afternoon"), offline-first state with Supabase, and conflict resolution on reconnect.

Local-first state management — syncs with Supabase on reconnect with deterministic conflict resolution

React NativeExpoTypeScriptSupabaseGPT-4o

GitHub ↗

// Research Automation

PaperHunter

Tracks and curates the latest papers from top CS venues via arXiv API. Intelligent filtering, topic categorisation, and weekly digest delivery. Useful for staying current across ACL, NeurIPS, ICLR, and CVPR tracks.

Automated curation pipeline — keyword + embedding-based relevance filtering with digest delivery

PythonarXiv APIAutomation

GitHub ↗

// Game AI

Graph Gomoku AI

Full-stack Gomoku with a Minimax + Alpha-Beta pruning engine and real-time D3.js game-tree visualisation over WebSocket. Designed as both a playable game and a visual demonstration of adversarial search.

Live game-tree rendering via D3.js + WebSocket — visualises Alpha-Beta pruning in real time

JavaSpring BootReactTypeScriptD3.js

GitHub ↗

Building multimodal AI systems where research meets production

Multimodal Home Environment Interaction

ACM MM 2026 & KDD 2026

Academic Work

CCD-VQA: Benchmark Construction and Evaluation for Traffic Circumstances Understanding in Multimodal Large Language Models

SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Multimodal LLM-Based Home Environment Interaction System

Tools I Work With

Selected Projects

Teaching-Learning

PolyInterview · AI Mock Interview Platform

AI Investment Terminal

SmartJournal

PaperHunter

Graph Gomoku AI

Let's Connect