We’re Moving Beyond Text — and It’s a Big Deal
For a while now, large language models have done exactly what it says on the tin: they work with language. You give them text, they give you text back. Everything meaningful happens inside that little bubble.
That era? It’s ending.
We’re watching something shift right now. Language models are turning into something bigger: systems that actually understand and generate across text, images, audio, video, and messy real-world data. And not as separate tools glued together—but as one unified thing.
This isn’t a small step. It changes what AI even is.
From Words to the Real World
A text-only model lives in abstraction. It reads descriptions of things. It’s one step removed.
A multimodal model starts to understand the things themselves.
Here’s what that means in practice:
Instead of typing:
„Describe what a login screen looks like“
You can now:
- Take a screenshot and ask, „How would you improve this?“
- Sketch something on a napkin and turn it into working code
- Record a quick voice memo and have it become structured data
- Throw in logs, charts, and notes and ask for one coherent answer
The model isn’t just finishing your sentences anymore. It’s interpreting context across different mediums.
That’s a subtle shift, but a huge one. We’re moving from predicting text to approximating a real understanding of the world.
Why This Matters for Local Models
Here’s the surprising part: this isn’t just happening in massive cloud systems.
Local models—like Gemma and others you can run on your own machine—are starting to get multimodal too. They might be smaller or more specialized, but the direction is clear. Multimodality is becoming table stakes.
And that changes everything.
Running a multimodal model locally means:
- Your images, documents, and recordings stay on your device
- Things happen way faster (no waiting on the cloud)
- Your tools work offline or in sensitive environments
- You can deeply integrate AI into your personal workflows
This is especially exciting for phones.
Think about what your phone already has:
- A great camera
- Microphones
- All kinds of sensors
- Surprisingly powerful compute
Now pair that with a local multimodal model. Your phone stops being just a screen you stare at. It becomes something that actually understands your environment.
Picture this:
- Snap a photo of a whiteboard → instantly turns into structured project tasks
- Record a meeting → get a fully searchable knowledge base, no upload required
- Point your camera at a chart → ask questions about the data right there
No cloud. No privacy worries. No waiting.
A Whole New Kind of Tool
As multimodal models become more accessible, the way we build software changes.
Instead of rigid interfaces and endless menus, tools become fluid. They just… interpret what you give them.
You no longer need:
- Special import pipelines for every file type
- Strict data schemas defined upfront
- Separate apps for text, images, and data
The model becomes the interface.
That opens the door to some really cool tools:
- A workspace where any input—anything—can be analyzed and transformed
- A creative space where sketches, text prompts, and references blend together
- Debugging tools that understand screenshots, error logs, and code all at once
- Knowledge systems that ingest „whatever you throw at it“ and stay searchable
The hard part shifts from you figuring out the tool to the model figuring it out for you.
For Developers, This Is Huge
If you write code, this change hits even closer to home.
AI-assisted programming used to mean:
„Here’s a function. Finish it for me.“
Now it’s becoming:
„Here’s my screen, my logs, my data, and what I’m trying to do. Help me fix this.“
Multimodal AI can:
- Read your code and see what the UI actually looks like
- Connect runtime errors with visual glitches
- Understand architecture diagrams alongside the implementation
- Generate working code from a rough sketch or screenshot
This closes the gap between having an idea, building it, and fixing it.
And when it all runs locally? It becomes deeply part of your development environment. No context switching. No sending your code to some server. No dependencies on external APIs that might change or disappear.
Moving Beyond Screens and Menus
Here’s what excites me most: we’re starting to move beyond traditional interfaces altogether.
Instead of clicking through menus and filling out forms, you just… interact. Like you would with a collaborator.
You show, tell, sketch, or describe. The system interprets and responds however makes sense—visually, in text, with audio, whatever works. Iteration becomes conversational and visual at the same time.
In that sense, multimodal models don’t just improve software. They make parts of it disappear.
The UI becomes optional.
A Quiet but Radical Shift
This is happening fast, but it’s easy to underestimate how big it is.
Multimodality means:
- AI understands more of the real world
- Tools become more personal and adaptable
- Local devices gain capabilities that used to require the cloud
- Programming shifts from memorizing syntax to just describing what you want
And most importantly: it lowers the friction between thinking something and actually making it real.
We’re not just getting better models.
We’re getting closer to systems that can work with us in the same messy, beautiful mediums we actually use to think.
Computer…? What shall we do next?
WEBSTERIX