Edge AI vs Cloud AI: choosing the right architecture

Five years ago, all AI inference happened in the cloud. Today, a quantised TensorFlow Lite model runs on a $2 microcontroller. So where should your AI actually live? Here's our decision framework.

The four architectures

Inference can happen at four levels in a connected system, each with different trade-offs:

Device edge: directly on the sensor/MCU (Cortex-M, ESP32, dedicated NPU)
Gateway edge: on a local hub (Raspberry Pi-class, NVIDIA Jetson, x86 industrial PC)
Cloud: remote servers (AWS, Azure, GCP)
Hybrid: distributed across multiple levels

When to use device edge

Run inference on the device itself when:

Latency is critical: safety-critical decisions in <10ms (industrial vision, motor control)
Connectivity is unreliable: remote sites, mobile assets, intermittent coverage
Privacy is paramount: data never leaves the device (medical, biometric)
Bandwidth is expensive: cellular IoT where every byte counts

Best for: keyword spotting, anomaly detection in vibration data, image classification with small models, simple gesture recognition.

When to use gateway edge

Push inference to a local gateway when:

Models are too large for a microcontroller (>10 MB)
Multiple sensors need to be correlated locally
You need a local user interface or local data dashboards
Bandwidth to cloud is limited but local network is fine

Best for: smart buildings (correlate dozens of sensors), industrial vision pipelines, multi-camera analytics, edge servers for retail/restaurant chains.

When to use cloud AI

Keep inference in the cloud when:

Models are very large (LLMs, large vision models)
You need access to data from many devices simultaneously
Latency requirements are lax (seconds to minutes)
You want to evolve models frequently without firmware updates
Computational resources scale dynamically with demand

Best for: fleet-level analytics, generative AI (LLM-powered chatbots, agents), historical pattern recognition, advanced predictive maintenance.

The hybrid approach (often the right answer)

In production deployments, we frequently use all three levels:

Device: detect & pre-classify events (low-latency, no bandwidth)
Gateway: aggregate, correlate, run mid-size models
Cloud: train models, run heavy inference, store history

This hybrid architecture is more complex to design and operate, but it usually wins on cost, latency and reliability.

Designing your AI architecture?

The right placement of inference is often the difference between a successful product and a costly mess. Let's discuss your use case.

Talk to engineering →