🔬 Research Interests

📚 Selected Research Projects

Mini-Onevision: Lightweight Multimodal Agent Framework with GRPO Feb 2026 - Present

Goal: Build a lightweight Multimodal Large Language Model (MLLM) with Visual-Agentic capabilities. By integrating Qwen3-0.5B (Language Base) and SigLIP (Vision Encoder) and utilizing DeepSeek R1's Group Relative Policy Optimization (GRPO), the project aims to enhance logical reasoning and tool-use capabilities in low-compute environments.

Methodology:

  • Phase I: Architecture & Alignment. Using SigLIP as Vision Encoder and Qwen3-0.5B as LLM backbone. Designed and trained a projection layer to map visual features to language space. Pre-trained on LLaVA Bench datasets for basic image understanding.
  • Phase II: SFT Cold Start. Synthesized high-quality Tool-calling datasets (single-turn & multi-turn). Performed SFT cold start to enable basic Agent interaction formats (JSON output, API selection).
  • Phase III: Agentic RL with GRPO. Built a lightweight RL framework referencing Slime/Verl. Adopted GRPO algorithm combined with RAG & Search Tools, using retrieval accuracy and task completion as reward signals.

Key Advantages:

  • Low-Cost Training: Lightweight Qwen3-0.5B + SigLIP combination avoids high pre-training costs.
  • Reproducibility: Qwen3's rich embedded Agentic data supports strong Zero-shot Tool Use.
  • GRPO RL Paradigm: Replicates GRPO on a small multimodal model (low VRAM usage via Group sampling) to stimulate Chain-of-Thought generation and deep reasoning.
  • Visual Agentic Loop: Establishes a complete Perception -> Planning -> Action chain.

Self-Powered Multimodal Emotion Recognition System Dec 2023 - Aug 2024

Lead Researcher | Advisor: Prof. Ding Zheng (UESTC)

Project Introduction: Developed a wearable device integrating voice and EEG signal analysis for emotion detection.

Personal Contributions:

  • Completed the model architecture, designed a bimodal binding mechanism to process voice signals, combined modality-invariant and modality-specific features to input into the Transformer; designed a wavelet-transform-based preprocessing method to convert EEG signals into time-frequency image blocks to input into the Vision Transformer.
  • Prepared flexible organic photovoltaic devices using PM6:Y6 = 1:1.2 as the organic active layer and ITO-PET as the flexible cathode, achieving a photoelectric conversion efficiency of over 12.39%.

Achievements:

  • First-author paper accepted at ICDT 2025 (Oral Presentation)
  • National First Prize in IoT Design Competition 2024
  • National Innovation Training Program (Excellent Rating)

Superionic Conductor Materials Screening & Phase Prediction Jan 2025 - May 2025

Lead Researcher | Advisor: Prof. Hong Zhu (SJTU)

Project Introduction: Using a small amount of data on superionic conductor materials in the Li-N-S system, the general potential model MatterSim is fine-tuned, and then the fine-tuned model is used to screen the generation model MatterGen to obtain stable materials in the Li-N-S system.

Personal Contributions — Constructed an active learning loop to iteratively update the model and data:

  • Used a small amount of labelled superionic conductor material data to train and test MatterSim.
  • Generated unlabelled superionic conductor structures using MatterGen.
  • Loaded MatterSim as a structure prediction model and used the Query By Committee method to obtain its uncertainty.
  • Performed DFT calculations on the data with the highest uncertainty and added it to the training data for the next round of training.

Achievement: Selected for SJTU-Global College Research Internship Program 2025.

(Projects are shown in chronological order.)