🔬 Research Interests
- Multimodal Data / Sensor Fusion
- AI for Science (AI4S)
📚 Selected Research Projects
Mini-Onevision: Lightweight Multimodal Agent Framework with GRPO Feb 2026 - Present
Goal: Build a lightweight Multimodal Large Language Model (MLLM) with Visual-Agentic capabilities. By integrating Qwen3-0.5B (Language Base) and SigLIP (Vision Encoder) and utilizing DeepSeek R1's Group Relative Policy Optimization (GRPO), the project aims to enhance logical reasoning and tool-use capabilities in low-compute environments.
Methodology:
- Phase I: Architecture & Alignment. Using SigLIP as Vision Encoder and Qwen3-0.5B as LLM backbone. Designed and trained a projection layer to map visual features to language space. Pre-trained on LLaVA Bench datasets for basic image understanding.
- Phase II: SFT Cold Start. Synthesized high-quality Tool-calling datasets (single-turn & multi-turn). Performed SFT cold start to enable basic Agent interaction formats (JSON output, API selection).
- Phase III: Agentic RL with GRPO. Built a lightweight RL framework referencing Slime/Verl. Adopted GRPO algorithm combined with RAG & Search Tools, using retrieval accuracy and task completion as reward signals.
Key Advantages:
- Low-Cost Training: Lightweight Qwen3-0.5B + SigLIP combination avoids high pre-training costs.
- Reproducibility: Qwen3's rich embedded Agentic data supports strong Zero-shot Tool Use.
- GRPO RL Paradigm: Replicates GRPO on a small multimodal model (low VRAM usage via Group sampling) to stimulate Chain-of-Thought generation and deep reasoning.
- Visual Agentic Loop: Establishes a complete Perception -> Planning -> Action chain.
Self-Powered Multimodal Emotion Recognition System Dec 2023 - Aug 2024
Lead Researcher | Advisor: Prof. Ding Zheng (UESTC)
Project Introduction: Developed a wearable device integrating voice and EEG signal analysis for emotion detection.
Personal Contributions:
- Completed the model architecture, designed a bimodal binding mechanism to process voice signals, combined modality-invariant and modality-specific features to input into the Transformer; designed a wavelet-transform-based preprocessing method to convert EEG signals into time-frequency image blocks to input into the Vision Transformer.
- Prepared flexible organic photovoltaic devices using PM6:Y6 = 1:1.2 as the organic active layer and ITO-PET as the flexible cathode, achieving a photoelectric conversion efficiency of over 12.39%.
Achievements:
- First-author paper accepted at ICDT 2025 (Oral Presentation)
- National First Prize in IoT Design Competition 2024
- National Innovation Training Program (Excellent Rating)
Superionic Conductor Materials Screening & Phase Prediction Jan 2025 - May 2025
Lead Researcher | Advisor: Prof. Hong Zhu (SJTU)
Project Introduction: Using a small amount of data on superionic conductor materials in the Li-N-S system, the general potential model MatterSim is fine-tuned, and then the fine-tuned model is used to screen the generation model MatterGen to obtain stable materials in the Li-N-S system.
Personal Contributions — Constructed an active learning loop to iteratively update the model and data:
- Used a small amount of labelled superionic conductor material data to train and test MatterSim.
- Generated unlabelled superionic conductor structures using MatterGen.
- Loaded MatterSim as a structure prediction model and used the Query By Committee method to obtain its uncertainty.
- Performed DFT calculations on the data with the highest uncertainty and added it to the training data for the next round of training.
Achievement: Selected for SJTU-Global College Research Internship Program 2025.
(Projects are shown in chronological order.)