Five years ago, all AI inference happened in the cloud. Today, a quantised TensorFlow Lite model runs on a $2 microcontroller. So where should your AI actually live? Here's our decision framework.
The four architectures
Inference can happen at four levels in a connected system, each with different trade-offs:
- Device edge: directly on the sensor/MCU (Cortex-M, ESP32, dedicated NPU)
- Gateway edge: on a local hub (Raspberry Pi-class, NVIDIA Jetson, x86 industrial PC)
- Cloud: remote servers (AWS, Azure, GCP)
- Hybrid: distributed across multiple levels
When to use device edge
Run inference on the device itself when:
- Latency is critical: safety-critical decisions in <10ms (industrial vision, motor control)
- Connectivity is unreliable: remote sites, mobile assets, intermittent coverage
- Privacy is paramount: data never leaves the device (medical, biometric)
- Bandwidth is expensive: cellular IoT where every byte counts
Best for: keyword spotting, anomaly detection in vibration data, image classification with small models, simple gesture recognition.
When to use gateway edge
Push inference to a local gateway when:
- Models are too large for a microcontroller (>10 MB)
- Multiple sensors need to be correlated locally
- You need a local user interface or local data dashboards
- Bandwidth to cloud is limited but local network is fine
Best for: smart buildings (correlate dozens of sensors), industrial vision pipelines, multi-camera analytics, edge servers for retail/restaurant chains.
When to use cloud AI
Keep inference in the cloud when:
- Models are very large (LLMs, large vision models)
- You need access to data from many devices simultaneously
- Latency requirements are lax (seconds to minutes)
- You want to evolve models frequently without firmware updates
- Computational resources scale dynamically with demand
Best for: fleet-level analytics, generative AI (LLM-powered chatbots, agents), historical pattern recognition, advanced predictive maintenance.
The hybrid approach (often the right answer)
In production deployments, we frequently use all three levels:
- Device: detect & pre-classify events (low-latency, no bandwidth)
- Gateway: aggregate, correlate, run mid-size models
- Cloud: train models, run heavy inference, store history
This hybrid architecture is more complex to design and operate, but it usually wins on cost, latency and reliability.
Designing your AI architecture?
The right placement of inference is often the difference between a successful product and a costly mess. Let's discuss your use case.
Talk to engineering →