My research broadly focuses on bridging embedded and mobile systems with emerging technologies like artificial intelligence, emphasizing system efficiency, sustainability, and user accessibility. Mobile and embedded AI systems and applications often face challenges in limited computing resources and demands in long-term sustainable operations. I aim to develop innovative solutions that address real-world challenges while optimizing resource consumption and enhancing usability for diverse applications.
Meanwhile, I am also a lifelong HCI (aka. Human-Cat Interaction) researcher. My research primarily focuses on how to efficiently communicate and interact with Miss Hope, an adopted American Shorthair raised in the Midwest. My current projects—How to Prevent Your Cat from Tampering with Jumper Wires and How to Evict Your Feline Overlord from Your Keyboard and Desk—are graciously funded by our first electric officer and wiring manager. They ensure that my circuits (and sanity) remain intact while Miss Hope continues to offer new challenges in the ever-evolving field of HCI.
") does not match the recommended repository name for your site ("
").
", so that your site can be accessed directly at "http://
".
However, if the current repository name is intended, you can ignore this message by removing "{% include widgets/debug_repo_name.html %}
" in index.html
.
",
which does not match the baseurl
("
") configured in _config.yml
.
baseurl
in _config.yml
to "
".
Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing
Asia and South Pacific Design Automation Conference (ASPDAC) 2025
Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and might include energy-harvesting modules that extract energy from the environment. In this work, we propose E-QUARTIC, a novel Energy Efficient Edge Ensembling framework to build ensembles of CNNs targeting Artificial Intelligence (AI)-based embedded systems. Our design outperforms single-instance CNN baselines and state-of-the-art edge AI solutions, improving accuracy and adapting to varying energy conditions while maintaining similar memory requirements. Then, we leverage the multi-CNN structure of the designed ensemble to implement an energy-aware model selection policy in energy-harvesting AI systems. We show that our solution outperforms the state-of-the-art by reducing system failure rate by up to 40% while ensuring higher average output qualities. Ultimately, we show that the proposed design enables concurrent on-device training and high-quality inference execution at the edge, limiting the performance and energy overheads to less than 0.04%.
Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing
Asia and South Pacific Design Automation Conference (ASPDAC) 2025
Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and might include energy-harvesting modules that extract energy from the environment. In this work, we propose E-QUARTIC, a novel Energy Efficient Edge Ensembling framework to build ensembles of CNNs targeting Artificial Intelligence (AI)-based embedded systems. Our design outperforms single-instance CNN baselines and state-of-the-art edge AI solutions, improving accuracy and adapting to varying energy conditions while maintaining similar memory requirements. Then, we leverage the multi-CNN structure of the designed ensemble to implement an energy-aware model selection policy in energy-harvesting AI systems. We show that our solution outperforms the state-of-the-art by reducing system failure rate by up to 40% while ensuring higher average output qualities. Ultimately, we show that the proposed design enables concurrent on-device training and high-quality inference execution at the edge, limiting the performance and energy overheads to less than 0.04%.
Yubo Luo, Le Zhang, Zhenyu Wang, Shahriar Nirjon
International Conference on Embedded Wireless Systems and Networks (EWSN) 2024
We present Antler, which exploits the affinity between all pairs of tasks in a multitask inference system to construct a compact graph representation of the task set and finds an optimal order of execution of the tasks such that the end-to-end time and energy cost of inference is reduced while the accuracy remains similar to the state-of-the-art. The design of Antler is based on two observations: first, tasks running on the same platform shows affinity, which is leveraged to find a compact graph representation of the tasks that helps avoid unnecessary computations of overlapping subtasks in the task set; and second, tasks that run on the same system may have dependencies, which is leveraged to find an optimal ordering of the tasks that helps avoid unnecessary computations of the dependent tasks or the remaining portion of a task. We implement two systems: a 16-bit TI MSP430FR5994-based custom-designed ultra-low-power system, and a 32-bit ARM Cortex M4/M7-based off-the-shelf STM32H747 board. We conduct both dataset-driven experiments as well as real-world deployments with these systems. We observe that Antler's execution time and energy consumption are the lowest compared to all baseline systems and by leveraging the similarity of tasks and by reusing the intermediate results from previous task, Antler reduces the inference time by 2.3X-4.6X and saves 56%-78% energy, when compared to the state-of-the-art.
Yubo Luo, Le Zhang, Zhenyu Wang, Shahriar Nirjon
International Conference on Embedded Wireless Systems and Networks (EWSN) 2024
We present Antler, which exploits the affinity between all pairs of tasks in a multitask inference system to construct a compact graph representation of the task set and finds an optimal order of execution of the tasks such that the end-to-end time and energy cost of inference is reduced while the accuracy remains similar to the state-of-the-art. The design of Antler is based on two observations: first, tasks running on the same platform shows affinity, which is leveraged to find a compact graph representation of the tasks that helps avoid unnecessary computations of overlapping subtasks in the task set; and second, tasks that run on the same system may have dependencies, which is leveraged to find an optimal ordering of the tasks that helps avoid unnecessary computations of the dependent tasks or the remaining portion of a task. We implement two systems: a 16-bit TI MSP430FR5994-based custom-designed ultra-low-power system, and a 32-bit ARM Cortex M4/M7-based off-the-shelf STM32H747 board. We conduct both dataset-driven experiments as well as real-world deployments with these systems. We observe that Antler's execution time and energy consumption are the lowest compared to all baseline systems and by leveraging the similarity of tasks and by reusing the intermediate results from previous task, Antler reduces the inference time by 2.3X-4.6X and saves 56%-78% energy, when compared to the state-of-the-art.
Le Zhang, Yubo Luo, Shahriar Nirjon
ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) 2022
Resource-optimized deep neural networks (DNNs) nowadays run on microcontrollers to perform a wide variety of audio, image and sensor data classification tasks. Despite comprehensive support for deep learning tools for 32-bit microcontrollers, performing deep learning inferences on 16-bit microcontrollers still remains a chal-lenge. Although there are some tools for implementing neural net-works on 16-bit systems, generally, there is a large gap in efficiency between the development tools for 16-bit microcontrollers and 32-bit (or higher) systems. There is also a steep learning curve that discourages beginners inexperienced with microcontrollers and programming in C to develop efficient and effective deep learning models for 16-bit microcontrollers. To fill this gap, we have created a neural network model generator that (1) automatically transfers parameters of a pre-trained DNN or CNN model from commonly used frameworks to a 16-bit microcontroller, and (2) automatically implements the model on the microcontroller to perform on-device inference. The optimization of data transfer saves time and mini-mizes chances of error, and the automatic implementation reduces the complexity to implement DNNs and CNNs on ultra-low-power microcontrollers.
Le Zhang, Yubo Luo, Shahriar Nirjon
ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) 2022
Resource-optimized deep neural networks (DNNs) nowadays run on microcontrollers to perform a wide variety of audio, image and sensor data classification tasks. Despite comprehensive support for deep learning tools for 32-bit microcontrollers, performing deep learning inferences on 16-bit microcontrollers still remains a chal-lenge. Although there are some tools for implementing neural net-works on 16-bit systems, generally, there is a large gap in efficiency between the development tools for 16-bit microcontrollers and 32-bit (or higher) systems. There is also a steep learning curve that discourages beginners inexperienced with microcontrollers and programming in C to develop efficient and effective deep learning models for 16-bit microcontrollers. To fill this gap, we have created a neural network model generator that (1) automatically transfers parameters of a pre-trained DNN or CNN model from commonly used frameworks to a 16-bit microcontroller, and (2) automatically implements the model on the microcontroller to perform on-device inference. The optimization of data transfer saves time and mini-mizes chances of error, and the automatic implementation reduces the complexity to implement DNNs and CNNs on ultra-low-power microcontrollers.