Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing
Asia and South Pacific Design Automation Conference (ASPDAC) 2025
Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and might include energy-harvesting modules that extract energy from the environment. In this work, we propose E-QUARTIC, a novel Energy Efficient Edge Ensembling framework to build ensembles of CNNs targeting Artificial Intelligence (AI)-based embedded systems. Our design outperforms single-instance CNN baselines and state-of-the-art edge AI solutions, improving accuracy and adapting to varying energy conditions while maintaining similar memory requirements. Then, we leverage the multi-CNN structure of the designed ensemble to implement an energy-aware model selection policy in energy-harvesting AI systems. We show that our solution outperforms the state-of-the-art by reducing system failure rate by up to 40% while ensuring higher average output qualities. Ultimately, we show that the proposed design enables concurrent on-device training and high-quality inference execution at the edge, limiting the performance and energy overheads to less than 0.04%.
Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing
Asia and South Pacific Design Automation Conference (ASPDAC) 2025
Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and might include energy-harvesting modules that extract energy from the environment. In this work, we propose E-QUARTIC, a novel Energy Efficient Edge Ensembling framework to build ensembles of CNNs targeting Artificial Intelligence (AI)-based embedded systems. Our design outperforms single-instance CNN baselines and state-of-the-art edge AI solutions, improving accuracy and adapting to varying energy conditions while maintaining similar memory requirements. Then, we leverage the multi-CNN structure of the designed ensemble to implement an energy-aware model selection policy in energy-harvesting AI systems. We show that our solution outperforms the state-of-the-art by reducing system failure rate by up to 40% while ensuring higher average output qualities. Ultimately, we show that the proposed design enables concurrent on-device training and high-quality inference execution at the edge, limiting the performance and energy overheads to less than 0.04%.
Yubo Luo, Le Zhang, Zhenyu Wang, Shahriar Nirjon
International Conference on Embedded Wireless Systems and Networks (EWSN) 2024
We present Antler, which exploits the affinity between all pairs of tasks in a multitask inference system to construct a compact graph representation of the task set and finds an optimal order of execution of the tasks such that the end-to-end time and energy cost of inference is reduced while the accuracy remains similar to the state-of-the-art. The design of Antler is based on two observations: first, tasks running on the same platform shows affinity, which is leveraged to find a compact graph representation of the tasks that helps avoid unnecessary computations of overlapping subtasks in the task set; and second, tasks that run on the same system may have dependencies, which is leveraged to find an optimal ordering of the tasks that helps avoid unnecessary computations of the dependent tasks or the remaining portion of a task. We implement two systems: a 16-bit TI MSP430FR5994-based custom-designed ultra-low-power system, and a 32-bit ARM Cortex M4/M7-based off-the-shelf STM32H747 board. We conduct both dataset-driven experiments as well as real-world deployments with these systems. We observe that Antler's execution time and energy consumption are the lowest compared to all baseline systems and by leveraging the similarity of tasks and by reusing the intermediate results from previous task, Antler reduces the inference time by 2.3X-4.6X and saves 56%-78% energy, when compared to the state-of-the-art.
Yubo Luo, Le Zhang, Zhenyu Wang, Shahriar Nirjon
International Conference on Embedded Wireless Systems and Networks (EWSN) 2024
We present Antler, which exploits the affinity between all pairs of tasks in a multitask inference system to construct a compact graph representation of the task set and finds an optimal order of execution of the tasks such that the end-to-end time and energy cost of inference is reduced while the accuracy remains similar to the state-of-the-art. The design of Antler is based on two observations: first, tasks running on the same platform shows affinity, which is leveraged to find a compact graph representation of the tasks that helps avoid unnecessary computations of overlapping subtasks in the task set; and second, tasks that run on the same system may have dependencies, which is leveraged to find an optimal ordering of the tasks that helps avoid unnecessary computations of the dependent tasks or the remaining portion of a task. We implement two systems: a 16-bit TI MSP430FR5994-based custom-designed ultra-low-power system, and a 32-bit ARM Cortex M4/M7-based off-the-shelf STM32H747 board. We conduct both dataset-driven experiments as well as real-world deployments with these systems. We observe that Antler's execution time and energy consumption are the lowest compared to all baseline systems and by leveraging the similarity of tasks and by reusing the intermediate results from previous task, Antler reduces the inference time by 2.3X-4.6X and saves 56%-78% energy, when compared to the state-of-the-art.
Run Wang, Shirley Bian, Xiaofan Yu, Quanling Zhao, Le Zhang, Tajana Rosing
ACM Conference on Embedded Networked Sensor Systems (SenSys) 2024
On-device environmental sound classification (ESC) in rural areas faces one major challenge of resource efficiency. Traditional methods rely on resource-intensive machine learning models, making them impractical for small edge devices like microcontrollers (MCUs). This poster presents SoundHD, a novel ESC solution using Hyperdimensional Computing (HDC), a brain-inspired and lightweight computing paradigm. We further optimize the memory footprint for deployment on MCUs. Our initial results show that SoundHD can be deployed and executed effectively on memory-constrained MCUs.
Run Wang, Shirley Bian, Xiaofan Yu, Quanling Zhao, Le Zhang, Tajana Rosing
ACM Conference on Embedded Networked Sensor Systems (SenSys) 2024
On-device environmental sound classification (ESC) in rural areas faces one major challenge of resource efficiency. Traditional methods rely on resource-intensive machine learning models, making them impractical for small edge devices like microcontrollers (MCUs). This poster presents SoundHD, a novel ESC solution using Hyperdimensional Computing (HDC), a brain-inspired and lightweight computing paradigm. We further optimize the memory footprint for deployment on MCUs. Our initial results show that SoundHD can be deployed and executed effectively on memory-constrained MCUs.
Le Zhang, Yubo Luo, Shahriar Nirjon
ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) 2022
Resource-optimized deep neural networks (DNNs) nowadays run on microcontrollers to perform a wide variety of audio, image and sensor data classification tasks. Despite comprehensive support for deep learning tools for 32-bit microcontrollers, performing deep learning inferences on 16-bit microcontrollers still remains a chal-lenge. Although there are some tools for implementing neural net-works on 16-bit systems, generally, there is a large gap in efficiency between the development tools for 16-bit microcontrollers and 32-bit (or higher) systems. There is also a steep learning curve that discourages beginners inexperienced with microcontrollers and programming in C to develop efficient and effective deep learning models for 16-bit microcontrollers. To fill this gap, we have created a neural network model generator that (1) automatically transfers parameters of a pre-trained DNN or CNN model from commonly used frameworks to a 16-bit microcontroller, and (2) automatically implements the model on the microcontroller to perform on-device inference. The optimization of data transfer saves time and mini-mizes chances of error, and the automatic implementation reduces the complexity to implement DNNs and CNNs on ultra-low-power microcontrollers.
Le Zhang, Yubo Luo, Shahriar Nirjon
ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) 2022
Resource-optimized deep neural networks (DNNs) nowadays run on microcontrollers to perform a wide variety of audio, image and sensor data classification tasks. Despite comprehensive support for deep learning tools for 32-bit microcontrollers, performing deep learning inferences on 16-bit microcontrollers still remains a chal-lenge. Although there are some tools for implementing neural net-works on 16-bit systems, generally, there is a large gap in efficiency between the development tools for 16-bit microcontrollers and 32-bit (or higher) systems. There is also a steep learning curve that discourages beginners inexperienced with microcontrollers and programming in C to develop efficient and effective deep learning models for 16-bit microcontrollers. To fill this gap, we have created a neural network model generator that (1) automatically transfers parameters of a pre-trained DNN or CNN model from commonly used frameworks to a 16-bit microcontroller, and (2) automatically implements the model on the microcontroller to perform on-device inference. The optimization of data transfer saves time and mini-mizes chances of error, and the automatic implementation reduces the complexity to implement DNNs and CNNs on ultra-low-power microcontrollers.