Optimizing generative AI models for edge deployment: Techniques and best practices

Pentaparthi, Sai Kalyan Reddy (2025) Optimizing generative AI models for edge deployment: Techniques and best practices. World Journal of Advanced Research and Reviews, 26 (1). pp. 1485-1492. ISSN 2581-9615

Abstract

Generative AI models represent a significant advancement in content creation capabilities but face substantial challenges when deployed at the network edge due to inherent resource constraints. This article examines comprehensive optimization strategies for enabling generative AI functionality on edge devices without requiring cloud connectivity. The exponential growth in model size has created a widening gap between computational requirements and the limited resources available in edge environments. Through systematic model compression, architectural redesign, and hardware-software co-optimization, generative models can achieve dramatic efficiency improvements while maintaining acceptable quality thresholds. The compression techniques examined include pruning methodologies that systematically eliminate redundant parameters, quantization approaches that reduce numerical precision, and knowledge distillation methods that transfer capabilities from larger models to compact alternatives. Architectural innovations such as modified attention mechanisms, conditional computation, and neural architecture search further enhance efficiency by fundamentally rethinking model design for resource-constrained environments. The integration of these techniques with hardware-specific optimizations and specialized software frameworks enables practical deployment across diverse application domains. Real-world implementations in speech processing, computer vision, and industrial IoT demonstrate that properly optimized generative models can operate within edge constraints while delivering near-real-time performance and maintaining high-quality outputs. These advancements empower industries to leverage generative AI capabilities in scenarios where privacy concerns, connectivity limitations, or latency requirements make cloud-based processing impractical.

Item Type:	Article
Official URL:	https://doi.org/10.30574/wjarr.2025.26.1.1161
Uncontrolled Keywords:	Generative AI; Edge Computing; Model Compression; Quantization; Neural Architecture Search; Hardware Acceleration
Date Deposited:	25 Jul 2025 14:33
Related URLs:	https://journalwjarr.com/node/1184 https://doi.org/10.30574/wjarr.2025.26.1... https://journalwjarr.com/sites/default/f...
URI:	https://eprint.scholarsrepository.com/id/eprint/1826

View Item