Sep 24, 2024 2 min read

Introducing Enhanced Image-to-Text Capabilities in Spaces

We are excited to announce the latest enhancements to our Spaces product, bringing powerful new functionalities that elevate your experience to the next level.

0:00

/2:13

What This Means for You

With these updates, you can now upload any picture, have it understood by our models, and use that understanding as input for further actions. For instance, you can analyze a photo from a conference, extract company names and websites, and enrich the data with additional information—all within seconds.

But that's not all...

Integration of Pixtral-12B by Mistral AI

We have integrated Pixtral-12B, the first-ever multimodal model from Mistral AI, into our platform. Pixtral-12B is a breakthrough in multimodal reasoning, trained with interleaved image and text data. It excels in tasks such as chart and figure understanding, document question answering, and instruction following without compromising on text-only benchmarks.

Key features of Pixtral-12B include:

Natively multimodal: Processes images at their native resolution and aspect ratio.
High performance: Achieved 52.5% on the MMMU reasoning benchmark, surpassing larger models.
Flexible architecture: Supports variable image sizes and aspect ratios, and handles multiple images within a context window of 128,000 tokens.

For more details on Pixtral-12B, visit the official announcement.

But wait, why just a single model, when you can have ALL top leading models....

Expanded Multi-Vision Model Support

In addition to Pixtral-12B, we now support a range of multi-vision models from top AI developers, including:

Anthropic Models: anthropic/claude-3-5-sonnet, anthropic/claude-3-5-sonnet-20240620, anthropic/claude-3-haiku, anthropic/claude-3-haiku-20240307, anthropic/claude-3-opus, anthropic/claude-3-opus-20240229, anthropic/claude-3-sonnet, anthropic/claude-3-sonnet-20240229.
OpenAI Models: openai/gpt-4o, openai/gpt-4o-2024-05-13, openai/gpt-4o-2024-08-06, openai/gpt-4o-mini, openai/gpt-4o-mini-2024-07-18, openai/gpt-4-turbo, openai/gpt-4-turbo-2024-04-09.
Google Models: google/gemini-1.5-pro, google/gemini-1.5-pro-001, google/gemini-1.5-pro-experimental, google/gemini-1.5-flash, google/gemini-1.5-flash-001, google/gemini-1.5-flash-experimental.

These models enhance our platform's ability to handle complex multimodal tasks, including text and image processing.

New Use Cases: Image-to-Text and Beyond

Our expanded capabilities now include robust image-to-text functionalities, enabling new use cases such as:

OCR and Text Extraction: Users can upload an image and extract all text, which can then be used by our agents to search the web and provide comprehensive answers.
Dynamic Image Understanding: Our multi-vision models can interpret uploaded images in real-time, making it easy to integrate visual data into your workflows.

This functionality extends our previous image generation tools, allowing not only the creation but also the interpretation of images. We mentioned this last week in our previous blog post here.

One Last Thing...

In an effort to make our Rag-as-a-Service functionality available for everyone, so that folks don't have to build complicated RAG pipelines by themselves but can easily just use our service, we decided to allow uploading one datasource in the free tier. This means that there is no longer a limit on having no datasource. Now, users in the free tier can upload one document or webpage and see the benefits of our platform.

We are committed to continuously improving our platform to provide you with the best possible tools for your needs. Stay tuned for more updates and innovations.

For more information on how to leverage these new features, please refer to our documentation, join our slack community or contact our support team.

Fabian Baier

Founder of Pulze.ai

San Francisco