The Phi-3-Vision: The Union Of Text And Image Understanding In The Modern AI Of Microsoft

The new platform, Phi-3 Vision, has been released by Microsoft, which is multimodal AI.
Microsoft, in one go, expands the Phi-3 compact language model series with the new J-Phi-3. Not only does this new frat member analyze text, but it can also interpret and understand images—it is a multimodal powerhouse.

With 4.2 billion parameters, optimized for mobile use, and general visual reasoning, the Phi-3-Vision will now permit querying with high-level questions on images or charts in order to get insightful answers. Unlike image-generation models, such as DALL·E or the Stable Diffusion, which are image-restoring models, the emphasis of the Phi-3-Vision is on image analysis and understanding to be able to offer deep knowledge from the visual data, not create new ones.

The Phi-3-Vision is the next model in the series after the Phi-3-Mini; it has 3.8 billion parameters. The family now has four members: Phi-3-mini, Phi-3-Vision, Phi-3-Small (7 billion parameters), and Phi.

This becoming one of the dominant trends in AI development: a rush toward models that are ever more efficient in the consuming of processing power and memory. The best models in this respect are optimized for mobile hosts or other resource-constrained platforms. Microsoft has already given an example of an already successful implementation of the approach with Orca-Math, which defeated its larger counterparts in arithmetic. Currently available is Phi-3-Vision in preview, while Phi-3-Mini, Phi-3-Small, and Phi-3-Medium are published in the Azure model library.