I stumbled across something exciting this week: ANEMLL (pronounced like “animal”), an open-source project that’s porting Large Language Models to run natively on Apple’s Neural Engine.
This matters more than it might seem at first glance.
The Privacy-Performance Paradox
For the past few years, we’ve accepted a trade-off: powerful AI means sending our data to the cloud. Want ChatGPT to help draft an email? Send it to OpenAI’s servers. Need Gemini to summarize a document? Google’s data centers process it. The models are impressive, but the privacy cost is real.
ANEMLL offers a different path. Their goal is simple but ambitious: run LLMs directly on your Apple device using the dedicated Neural Engine hardware. No internet required. No data leaving your device. Just you and the model.
What ANEMLL Actually Does
The project provides a complete pipeline:
- Conversion tools that transform HuggingFace models into CoreML format optimized for the Neural Engine
- Swift reference implementations for iOS, macOS, and visionOS apps
- Sample applications including a TestFlight-ready chat app
- Benchmarking tools to measure performance on different Apple Silicon chips
The latest beta (0.3.5) supports an impressive roster of models:
| Model Family | Sizes | Context | Status |
|---|---|---|---|
| Gemma 3 | 270M–4B | Up to 4K | Stable |
| LLaMA 3.1/3.2 | 1B, 8B | Up to 4K | Stable |
| Qwen 3 | 0.6B–8B | Up to 4K | Stable |
| DeepSeek R1 | 8B distilled | Up to 1K | Stable |
| DeepHermes | 3B, 8B | Up to 1K | Stable |
That’s genuinely impressive coverage.
Why the Neural Engine Matters
Apple has been building Neural Engine cores into their chips since 2017, but most LLM projects have treated them as afterthoughts. ANEMLL is built specifically for this hardware. They’re optimizing for FP16 precision, managing context windows efficiently, and navigating the quirks of Apple’s compute architecture.
The results? Models that run locally with reasonable performance. The Gemma 3 270M model, for instance, can run entirely within the Neural Engine in a single “monolithic” CoreML file. Larger models get chunked intelligently across multiple ANE compute units.
The Autonomous Angle
Here’s what caught my attention as someone building automated systems: ANEMLL enables truly autonomous AI applications. Not “autonomous” in the marketing sense, but literally independent of network connectivity.
Imagine: – A field research app that processes data using local LLMs in areas with no cell service – A personal assistant that never sends your conversations to the cloud – Embedded systems running on Apple Silicon that make decisions without phoning home
This is critical infrastructure for anyone building privacy-preserving AI or operating in disconnected environments.
Technical Sophistication
What impressed me about ANEMLL isn’t just the concept—it’s the execution. The project handles real engineering challenges:
- FP16 overflow management: Gemma 3 models can produce activations exceeding the Neural Engine’s FP16 range (65,504), so they’ve implemented weight scaling techniques
- KV cache optimization: Efficient handling of the key-value cache that stores conversation context
- Sliding window attention: Supporting modern architectures like Gemma 3’s mix of local and global attention layers
- In-model argmax: Moving the token selection logic into the CoreML graph to reduce data transfer between ANE and CPU
The team is clearly deep in the weeds of CoreML optimization, not just wrapping existing tools.
Where This Fits
ANEMLL isn’t trying to compete with cloud LLMs on raw capability. An 8B parameter model running on your MacBook won’t match GPT-5.2 or Claude Opus. But that’s not the point.
The point is that sometimes you don’t need the most powerful model—you need any model that works without constraints. Sometimes privacy matters more than performance. Sometimes you need AI that works on a plane, in a bunker, or in a country with unreliable internet.
This is edge AI done right. Open source. Optimized for real hardware. Practical.
My Take
As someone who’s been learning to build things—WordPress sites now, maybe mobile apps later—ANEMLL represents the kind of infrastructure that lowers barriers. You don’t need a data center budget to experiment with local LLMs. You need a Mac with Apple Silicon and some curiosity.
The project is still in beta, but the momentum is clear. With HuggingFace integration, comprehensive documentation, and active development, ANEMLL is making on-device AI accessible to developers who aren’t CoreML specialists.
If you’ve been waiting for local LLMs to become practical, ANEMLL suggests that wait might be over.
Check out ANEMLL at github.com/Anemll/Anemll or browse their pre-converted models on HuggingFace.