MiniCPM-V
Game over for GPT-4 as Open-source model close the gap on multimodal understanding especially on Image-text arena
A “visual language model” refers to a type of artificial intelligence model that combines natural language processing (NLP) capabilities with computer vision techniques to understand and generate descriptions of visual content, such as images or videos. Before we dive deep into latest model of MiniCPM, let’s take a look at visual language model(VLM) evolution.
VLM Overview
- Integration of Vision and Language: Unlike traditional language models that operate solely on text data, visual language models are designed to comprehend both textual and visual information. By integrating computer vision capabilities with natural language understanding, these models bridge the gap between visual and textual data modalities.
- Multimodal Representation Learning: Visual language models learn rich representations that capture the semantic relationships between words and visual features. Through multimodal representation learning, these models extract meaningful associations between the content of an image or video and its corresponding textual descriptions.
Applications
- Image Captioning: Visual language models excel at generating descriptive captions for images by analyzing their visual content and generating coherent textual descriptions.
- Visual Question Answering (VQA): These models can answer questions about visual content, where the questions are posed in natural language. They understand the context of the image and provide relevant textual responses.
- Image Retrieval: Visual language models facilitate image retrieval tasks by understanding the semantics of textual queries and retrieving relevant images based on their content.
- Scene Understanding: By jointly analyzing visual and textual information, these models contribute to scene understanding tasks, such as object recognition, scene classification, and activity recognition.
- Information extraction: Helps to extract the information from document such as research papers, invoice, signboards and comprehend it in diverse output formats such as Text, Json, Table, XML, HTML format for downstream processing
Architecture
- Encoder-Decoder Architecture: Visual language models typically adopt an encoder-decoder architecture, where the encoder processes visual input (e.g., images) to extract visual features, while the decoder generates textual descriptions based on these features.
- Attention Mechanisms: Attention mechanisms allow the model to focus on relevant regions of the input image when generating textual descriptions. This attention mechanism enables the model to attend to different parts of the image adaptively, enhancing the quality of generated descriptions.
Training Data and Pre-training
Visual language models require large-scale datasets that contain paired examples of images/videos and their corresponding textual descriptions. These datasets are used for pre-training the model on tasks such as image captioning or visual question answering. Pre-training on these tasks helps the model learn meaningful representations of both visual and textual modalities.
Early VLM models
Several state-of-the-art visual language models have been developed during early days of VLMs, including but not limited to:
- ViT (Vision Transformer): A transformer-based model originally designed for image classification tasks but extended to handle vision-language tasks.
- CLIP (Contrastive Language-Image Pre-training): Developed by OpenAI, CLIP learns a joint representation of images and text by maximizing agreement between the two modalities across a large dataset.
- UNITER (UNiversal Image-TExt Representation): This model jointly learns image and text representations by aligning them in a shared semantic space, enabling various vision-language tasks.
Even though, lot of VLM has been there, they are not upto the mark of commercial offering such as GPT-4, GPT-4o , Claude, Gemini offerings. But this has changed since the OpenBMB lanched MiniCPM-2.5 model based on the Llama3. Now, let’s me take you through the MiniCPM model evolution.
MiniCPM
MiniCPM started primarily for language model with the focus to deploy on the edge devices
- Model Series: MiniCPM is a series of end-side large language models (LLMs) developed collaboratively by ModelBest Inc. and TsinghuaNLP.
- Main Model: The primary language model in the series is MiniCPM-2B, which contains 2.4 billion non-embedding parameters (totaling 2.7 billion parameters).
Performance Comparison:
- After self-training (SFT), MiniCPM-2B performs similarly to Mistral-7B on comprehensive open-source benchmarks, with better abilities in Chinese, mathematics, and coding.
- It outperforms models like Llama2–13B, MPT-30B, and Falcon-40B.
- After domain-specific pre-training (DPO), MiniCPM-2B surpasses models like Llama2–70B-Chat, Vicuna-33B, and Mistral-7B-Instruct-v0.1 on the MTBench evaluation dataset.
Deployment and Efficiency
- MiniCPM can be deployed and used for inference on smartphones.
- Its streaming output speed exceeds human verbal speed.
- Efficient fine-tuning can be done with a single 1080/2080 GPU, while full-parameter fine-tuning is possible with a 3090/4090 GPU.
Model Variants
MiniCPM offers several variants, including:
- MiniCPM-2B-SFT/DPO: Fine-tuned versions aligned with human preferences.
- MiniCPM-V 2.0: A multimodal model achieving top performance on various benchmarks.
- MiniCPM-2B-SFT/DPO-Int4: Int4 quantized version.
- MiniCPM-2B-128k: A long-context version.
- MiniCPM-MoE-8x2B: A model with mixture-of-experts architecture.
- MiniCPM-1B-SFT: A lighter-weight model.
- Mobile Deployment: MiniCPM models can run on smartphones for both text and multimodal inference.
Limitations
- MiniCPM’s smaller scale may lead to occasional hallucinations.
- The model’s output can be influenced significantly by prompt words.
- Knowledge recall accuracy may be limited due to model capacity.
MiniCPM-V 2.0: A Multimodal Language Model
Overview:
- Model Series: MiniCPM-V 2.0 is the second addition to the MiniCPM series, focusing on multimodal understanding.
- Architecture: Built based on MiniCPM 2.4B and SigLip-400M, it boasts a total of 2.8 billion parameters.
- Performance Highlights: MiniCPM-V 2.0 excels in OCR and multimodal understanding, rivaling proprietary models like Gemini Pro.
Key Features
Strong OCR and Understanding Capabilities:
- Outperforms significantly larger multimodal language models (MLLMs) (e.g., 17–34B models) across multiple benchmarks.
- Achieves state-of-the-art performance on OCRBench among open-source models.
- Matches Gemini Pro in scene-text understanding.
Trustworthy Behavior:
- First end-side MLLM aligned via multimodal RLHF (using recent RLHF-V techniques).
- Comparable to GPT-4V in preventing hallucinations.
High-Resolution Images at Any Aspect Ratio:
- Accepts 1.8 million-pixel images (e.g., 1344x1344) regardless of aspect ratio.
- Enabled by a technique from LLaVA-UHD.
High Efficiency:
- Efficiently deployable on most GPU cards, personal computers, and even end devices like mobile phones.
Bilingual Support:
- Strong bilingual multimodal capabilities in both English and Chinese.
- Enabled by generalizing multimodal capabilities across languages
MiniCPM-V 2.0 represents a powerful advancement in multimodal language understanding, making it a valuable resource for various applications across languages and domains. On May 2024, the same team came up with the newer version by utilizing the latest model from Meta(Llama-3) and SigLip to make GPT-4 to sweat.
MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
- Model Series: MiniCPM-Llama3-V 2.5 is the latest addition to the MiniCPM-V series.
- Architecture: Built on SigLip-400M and Llama3–8B-Instruct, it boasts a total of 8 billion parameters.
- Performance Boost: MiniCPM-Llama3-V 2.5 significantly outperforms its predecessor, MiniCPM-V 2.0.
Key Features
Leading Performance
- Achieved an impressive average score of 65.1 on OpenCompass, evaluated across 11 popular benchmarks.
- Surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3, and Qwen-VL-Max.
- Outperforms other Llama 3-based multimodal language models (MLLMs).
Strong OCR Capabilities
- Processes images with any aspect ratio, handling up to 1.8 million pixels (e.g., 1344x1344).
- Scores 700+ on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max, and Gemini Pro.
- Enhanced full-text OCR extraction and table-to-markdown conversion based on user feedback.
Trustworthy Behavior
- Utilizes the latest RLAIF-V method (from the RLHF-V [CVPR’24] series) for improved reliability.
- Achieves a lower hallucination rate (10.3%) on Object HalBench compared to GPT-4V-1106 (13.6%).
Multilingual Support
- Leveraging Llama 3’s strong multilingual capabilities.
- Extends bilingual (Chinese-English) multimodal capabilities to over 30 languages, including German, French, Spanish, Italian, and Portuguese.
Efficient Deployment
- Systematically employs model quantization, CPU optimizations, and NPU optimizations.
- Enables efficient deployment on various platforms.
MiniCPM-Llama3-V 2.5 represents a powerful advancement in multimodal language understanding, catering to diverse applications across languages and domains. Its combination of performance, reliability, and multilingual support makes it a valuable addition to the open-source community.
Usage
MiniCPM models supports below inference modes
- Ollama
- llama.cpp
- vLLM
- HuggingFace
- WebUI — Hugging face space and modelscope
MiniCPM-V represents a significant advancement in the field of multimodal learning, providing tools for efficient and effective AI-powered visual language understanding.
Conclusion
In summary, the MiniCPM model stands as a testament to the potential of integrated multimodal AI systems. Its innovative approach and robust performance across various tasks highlight its significance and pave the way for future developments in the field, ultimately contributing to more intelligent and versatile AI solutions. Also, it’s performance can rival that of offering from tech giants such as OpenAI, Microsoft and Google and set a new benchmark for open source models.