Publications — Shine Yuan

Mini-Onevision: Lightweight Multimodal Agent Framework with GRPO Feb 2026 - Present

Goal: Build a lightweight Multimodal Large Language Model (MLLM) with Visual-Agentic capabilities. By integrating Qwen3-0.5B (Language Base) and SigLIP (Vision Encoder) and utilizing DeepSeek R1's Group Relative Policy Optimization (GRPO), the project aims to enhance logical reasoning and tool-use capabilities in low-compute environments.

Methodology:

Phase I: Architecture & Alignment. Using SigLIP as Vision Encoder and Qwen3-0.5B as LLM backbone. Designed and trained a projection layer to map visual features to language space. Pre-trained on LLaVA Bench datasets for basic image understanding.
Phase II: SFT Cold Start. Synthesized high-quality Tool-calling datasets (single-turn & multi-turn). Performed SFT cold start to enable basic Agent interaction formats (JSON output, API selection).
Phase III: Agentic RL with GRPO. Built a lightweight RL framework referencing Slime/Verl. Adopted GRPO algorithm combined with RAG & Search Tools, using retrieval accuracy and task completion as reward signals.

Key Advantages:

Low-Cost Training: Lightweight Qwen3-0.5B + SigLIP combination avoids high pre-training costs.
Reproducibility: Qwen3's rich embedded Agentic data supports strong Zero-shot Tool Use.
GRPO RL Paradigm: Replicates GRPO on a small multimodal model (low VRAM usage via Group sampling) to stimulate Chain-of-Thought generation and deep reasoning.
Visual Agentic Loop: Establishes a complete Perception -> Planning -> Action chain.

Self-Powered Multimodal Emotion Recognition System Dec 2023 - Aug 2024

Lead Researcher | Advisor: Prof. Ding Zheng (UESTC)

Project Introduction: Developed a wearable device integrating voice and EEG signal analysis for emotion detection.

Personal Contributions:

Completed the model architecture, designed a bimodal binding mechanism to process voice signals, combined modality-invariant and modality-specific features to input into the Transformer; designed a wavelet-transform-based preprocessing method to convert EEG signals into time-frequency image blocks to input into the Vision Transformer.
Prepared flexible organic photovoltaic devices using PM6:Y6 = 1:1.2 as the organic active layer and ITO-PET as the flexible cathode, achieving a photoelectric conversion efficiency of over 12.39%.

Achievements:

First-author paper accepted at ICDT 2025 (Oral Presentation)
National First Prize in IoT Design Competition 2024
National Innovation Training Program (Excellent Rating)

Superionic Conductor Materials Screening & Phase Prediction Jan 2025 - May 2025

Lead Researcher | Advisor: Prof. Hong Zhu (SJTU)

Project Introduction: Using a small amount of data on superionic conductor materials in the Li-N-S system, the general potential model MatterSim is fine-tuned, and then the fine-tuned model is used to screen the generation model MatterGen to obtain stable materials in the Li-N-S system.

Personal Contributions — Constructed an active learning loop to iteratively update the model and data:

Used a small amount of labelled superionic conductor material data to train and test MatterSim.
Generated unlabelled superionic conductor structures using MatterGen.
Loaded MatterSim as a structure prediction model and used the Query By Committee method to obtain its uncertainty.
Performed DFT calculations on the data with the highest uncertainty and added it to the training data for the next round of training.

Achievement: Selected for SJTU-Global College Research Internship Program 2025.

Shine Yuan

🔬 Research Interests

📚 Selected Research Projects

Mini-Onevision: Lightweight Multimodal Agent Framework with GRPO Feb 2026 - Present

Self-Powered Multimodal Emotion Recognition System Dec 2023 - Aug 2024

Superionic Conductor Materials Screening & Phase Prediction Jan 2025 - May 2025