Tiny Machine Learning Virtualization for IoT and Edge Computing using the REXA VM

Towards Learning Technical Systems

Stefan Bosse$^{1,2}$

Christoph Polle$^3$

$^1$University of Bremen, Dept. Mathematics & Computer Science, Bremen, Germany

$^2$University of Siegen, Dept. Mechanical Engineering, Siegen, Germany

$^3$Faserinstitut Bremen (FIBRE), Bremen, Germany
Overview

Tiny Machine Learning is a new approach that is being used for data-driven prediction, classification, and regression on microcontrollers using local sensor data.
Overview

Tiny Machine Learning is a new approach that is being used for data-driven prediction, classification, and regression on microcontrollers using local sensor data.

But even simple sensor data acquisition, aggregation, and processing is a challenge in distributed sensor network environments, the IoT, mobile networks, and other distributed strongly heterogeneous networks.
Overview

Tiny Machine Learning is a new approach that is being used for data-driven prediction, classification, and regression on microcontrollers using local sensor data.

But even simple sensor data acquisition, aggregation, and processing is a challenge in distributed sensor network environments, the IoT, mobile networks, and other distributed strongly heterogeneous networks.

The goal is to process sensor data locally and derive compressed relevant information features (e.g., damages, attacks, ...) with final global feature fusion.
Overview

To overcome issues and limitations with software and ML deployment in strong heterogeneous computer networks, the real-time capable low-resource Virtual Machine REXAVM is introduced.
Overview

To overcome issues and limitations with software and ML deployment in strongly heterogeneous computer networks, the real-time capable low-resource Virtual Machine REXAVM is introduced.

REXA VM provides Virtualization of basic ML operations and models including but limited to: Decision Trees, ANN, CNN
Overview

To overcome issues and limitations with software and ML deployment in strong heterogeneous computer networks, the real-time capable low-resource Virtual Machine REXAVM is introduced.

REXA VM provides Virtualization of basic ML operations and models including but limited to: Decision Trees, ANN, CNN

REXA VM and its ML operations can be deployed on low-resource microncontrollers like the STM32 ARM Cortex M-series starting with 20 kB of RAM and 32 kB ROM only!
Introduction

Fig. 1. Let's start here: A material-integrateable sensor node for damage diagnostics in Fibre-Metal Laminates using Guided Ultrasonic Waves (STM32 ARM Cortex M0, RFID, ADC) [IMSAS Bremen, B. Lüssem et al., 2023]
Host Platforms and Efficiency

Efficiency of data processing is always an important objective to optimize, especially for material-integrated sensor networks. The efficiency of data processing systems can be compared by the following normalized performance factor $\epsilon$:

$$\epsilon = \frac{C \cdot M}{A \cdot P}$$

- $C$: Data processing system's computational power in instructions per second (MIPS)
- $M$: Memory capacity (RAM/ROM) in k Bytes
- $A$: Entire chip area in $mm^2$
- $P$: Electrical power consumption in mW.
## Host Platforms and Efficiency

<table>
<thead>
<tr>
<th>Device</th>
<th>Chip Area</th>
<th>Clock/MIPS</th>
<th>Power</th>
<th>RAM/ROM</th>
<th>ε</th>
</tr>
</thead>
<tbody>
<tr>
<td>Atmel Tiny 20</td>
<td>2.1 mm^2</td>
<td>12 MHz</td>
<td>4 mW</td>
<td>0.1 kB/2 kB</td>
<td>3</td>
</tr>
<tr>
<td>ARM Cortex M0 (Smart Dust 2002)</td>
<td>0.1 mm^2</td>
<td>740 kHz</td>
<td>70 mW</td>
<td>4 kB/4 kB</td>
<td>0.84</td>
</tr>
<tr>
<td>FreeSclae KL03 (ARM Cortex M0+)</td>
<td>4 mm^2</td>
<td>48 MHz</td>
<td>3 mW</td>
<td>2 kB/40kB</td>
<td>168</td>
</tr>
<tr>
<td>STM32 F103VC M3</td>
<td>~10 mm^2</td>
<td>72 MHZ</td>
<td>200 mW</td>
<td>48 kB/256 kB</td>
<td>11</td>
</tr>
<tr>
<td>STM32 F103C8 M3</td>
<td>~6 mm^2 (meas.)</td>
<td>48 MHZ</td>
<td>100 mW</td>
<td>20 kB/64 kB</td>
<td>6.7</td>
</tr>
<tr>
<td>STM32 L031G6U6 M0+</td>
<td>0.25 mm^2 (meas.)</td>
<td>16 MHZ</td>
<td>2 mW</td>
<td>8 kB/32 kB</td>
<td>1280</td>
</tr>
<tr>
<td>STM32 L073CZU6 M0+</td>
<td>~1 mm^2</td>
<td>16/32 MHZ</td>
<td>5/12 mW</td>
<td>20 kB/192 kB</td>
<td>678/565</td>
</tr>
<tr>
<td>Xilinx Spartan 3-500E</td>
<td>9.6 mm^2 (meas.)</td>
<td>50 MHz</td>
<td>100 mW</td>
<td>45 kB</td>
<td>2.34</td>
</tr>
<tr>
<td>Xilinx Spartan 7-S25</td>
<td>~50 mm^2</td>
<td>100 MHz</td>
<td>100 mW</td>
<td>202 kB</td>
<td>4</td>
</tr>
</tbody>
</table>
The Concepts

1. VM with integrated compiler
2. Programs (and ANN models, too) are always delivered in textual format
3. On-the-fly compilation to linear Bytecode (< 600 lines of C code!)
4. No dynamic memory management except by stack operations
5. KISS (< 3000 lines of C code); highly configurable (custom ISA)
6. VM can be directly embedded in IO loops (microcontrollers) cooperating with other tasks
VM Architecture

Memory Model
VM Architecture

Memory Model

Instruction Set Architecture (Bytecode)
VM Architecture

- Memory Model
- Instruction Set Architecture (Bytecode)
- Real-time Features and Scheduling
VM Architecture

- Memory Model
- Instruction Set Architecture (Bytecode)
- Real-time Features and Scheduling
- Compiler
VM Architecture

- Memory Model
- Instruction Set Architecture (Bytecode)
- Real-time Features and Scheduling
- Compiler
- ML Core Operations
Memory Model and Instruction Processing

Fig. 2. Multi-stack Computer with mixed-mode code segment (no heap memory), integrated JIT compiler, and Bytecode processor (vmloop)
Instruction Set Architecture

- Most ops are zero-operand instructions (single world) operating directly on the stack(s) or the program counter

- With some exceptions the ISA can be freely defined (via code snippets and macro definitions, discussed in the SDK section)

- Zero-operand operations consume one Byte (see next slide)

- Most instructions have constant and equal execution times (real-time; run-time prediction possible)

But the widely used and well known FORTH programming language will be used commonly (or any sub-set; there is no real standard)
FORTH

- Reverse Polish Notation (stack language)
- "Write once and forget (read never)" issue
- But keeps compiler simple (low resources and compilation times)

```forth
var x
10 20 + x !
x @ . cr
: vecmean
  0
  100 0 do
    data i cell+ @ +
  loop
; 
vecmean . cr
```
Bytecode Format

Def. 1. REXA VM Bytecode Format (1 Byte: Post-fix operation, 2 Bytes: Short word, 4 Bytes: Double word, 3 Bytes: Code + Address)
Real-time Scheduling

A sensor node, in particular, has to process a set of tasks characterised by different priorities, arrival times, deadline, and execution times:

1. Signal sampling and generation (triggering)
2. Event detection
3. Communication (on-chip, on-board, or externally wireless)
4. Computation
5. Energy Management (energy savings and optimisation)
6. Service requests and processing (deadlines)
7. Watchdog (sensor and node failure detector)
Real-time Scheduling

- These tasks must be scheduled under time and performance constraints.

- Assuming only one physical control path (one processor), the tasks must be scheduled in slices by one main scheduling loop.

- Self-powered sensor nodes introduce additional energy constraints.

- The tasks can be classified as
  - event-based (short-running),
  - data-driven (long running), and
  - communication (event- and data-driven) tasks.
Execution Loop

The VM Bytecode execution and source-code compilation can be performed incrementally limiting the number of instruction steps or compiled tokens satisfying soft real-time constraints.

- VM loop is a monolithic switch-case block mapping instruction codes on code snippets executing the operations (optimized to linear goto LUT, Constant time)

Fig. 3. Nested execution loops in an embedded system
System Call-gate Interface

The system call-gate interface is a unified communication and execution interface to the REXA VM run-time environment and compiler. There are two complementary versions of the system call-gate interface addressing software and hardware implementations of the VM:

SM: A shared-memory architecture

MP: A message-based architecture

Fig. 4. System Call-gate Interface connecting a sensor node root application software to an isolated VM instance (a) via shared memory and a single system call function (b) via message-based communication and a serial link or signal bus
Pocket GUW Laboratory

- Example of an application using the call-gate interface: A digital oscilloscope equipped with the REXA VM

Fig. 5. The pocket GUW laboratory only using low-budget and low-quality devices for GUW-based damage detection in Fibre composite materials. The DSO implements REXA-VM and communicates via an USB virtCOM port with an external computer. [https://arxiv.org/abs/2302.09002v1](https://arxiv.org/abs/2302.09002v1)
Compiler

**Highlights**

- Just-in-time and incremental compiler
- In-place compilation Program Text ⇒ Bytecode via Code Segment frames
- Low memory requirements
- Use of hash and indexed Lookup Tables (LUT) for core instruction codes and user defined data and code (function words)
- Only static tables used with constant memory requirement
Vector Operations

- Only integer arithmetic is supported (by low-resource and low-power microcontrollers)

An ANN (and CNN) consists of two parts:

1. The data, i.e., for parameter, input, and output variables;
2. The structure and functions processing the data.

The ANN can be functionally decomposed into the following vector and matrix operations assuming integer approximation:

\[ f : \mathbb{R}^n \rightarrow \mathbb{R}^p \approx \mathbb{I}^n \rightarrow \mathbb{I}^p, \quad f = g \circ f_{l-1} \circ f_{l-2} \circ \ldots \circ f_1, f_i(x) = a(\hat{w}_i x + \hat{b}_i) \]

\[ g(\vec{z}) = \begin{cases} 
z & \text{regression} \\
\frac{1}{1 + e^{-z}} & \text{binary classification} \\
\frac{e^{\vec{z}_j}}{\sum_k (e^{-\vec{z}_k})} & \text{multi-classification}
\end{cases} \]
Vector Operations

- ANN models can be decomposed in chained vector operations!
- Vectors are initialised arrays (model parameters) or initialised arrays (input, intermediate, and output data)
- Vector (array) data is embedded in the Code Segment (no heap!)

<array>: [LEN:2][DATA:LEN*WORDSIZE]

Def. 2. Initialized arrays embedded in-place in code frames and non-initialized arrays stored at the end of the compiled code frame
Vector Operations

- For ANN and CNN models, a set of scaled vector (array) operations is provided (commonly $W=16$ Bits signed integer).
- Most vector operations are using $2W$ arithmetics internally (e.g., 32 Bits) with final down (or up) scaling of results
- Scaling parameters must be computed by a model analyzer

<table>
<thead>
<tr>
<th>Operation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>array</td>
<td>Create an initialised or uninitialised array (vector)</td>
</tr>
<tr>
<td>vecscale</td>
<td>Scale a vector (negative scale value: division, positive: multiplication)</td>
</tr>
<tr>
<td>vecadd vecmul</td>
<td>Elementwise vector addition and multiplication</td>
</tr>
<tr>
<td>vecfold</td>
<td>Folding operation (ANN FC layer for multiple neurons)</td>
</tr>
<tr>
<td>vecconv</td>
<td>Multi-purpose convolution and pooling operation (CNN)</td>
</tr>
<tr>
<td>vecmap</td>
<td>Elementwise application of a function (e.g., relu or sigmoid), used for ANNs and CNNs</td>
</tr>
<tr>
<td>vecreduce</td>
<td>Vector reduction (scalar output), e.g., minimum or maximum search, sum, product</td>
</tr>
</tbody>
</table>
## ANN Example

### Model Data

<table>
<thead>
<tr>
<th>Layer</th>
<th>Array</th>
<th>Elements</th>
</tr>
</thead>
<tbody>
<tr>
<td>I</td>
<td>input</td>
<td>14</td>
</tr>
<tr>
<td></td>
<td>wghtsI</td>
<td>{ 329, -499, ..., 10, 400 }</td>
</tr>
<tr>
<td></td>
<td>biasI</td>
<td>{ -764, 389, ..., -907, -405 }</td>
</tr>
<tr>
<td></td>
<td>scaleI</td>
<td>{ -3, 9, ..., 5, 9 }</td>
</tr>
<tr>
<td>H1</td>
<td>actI</td>
<td>14</td>
</tr>
<tr>
<td></td>
<td>wghtsH1</td>
<td>{ 622, -790, ..., 708, 248 }</td>
</tr>
<tr>
<td></td>
<td>scaleH1</td>
<td>{ 0, 5, ..., -4, 7 }</td>
</tr>
<tr>
<td>O</td>
<td>actH1</td>
<td>8</td>
</tr>
<tr>
<td></td>
<td>wghtsO</td>
<td>{ 869, 939, ..., 785, 910 }</td>
</tr>
<tr>
<td></td>
<td>biasO</td>
<td>{ 252, -565 }</td>
</tr>
<tr>
<td></td>
<td>scaleO</td>
<td>{ 4, 5 }</td>
</tr>
<tr>
<td></td>
<td>output</td>
<td>2</td>
</tr>
</tbody>
</table>

### Model Computation

( Input data is stored in input )
( Output data is stored in output )

: forward
  ( Layer I )
  input wghtsI actI scaleI vecmul
  actI biasI actI 0 vecadd
  actI actI $ sigmoid 0 vecmap
  ( Layer H1 )
  actI wghtsH1 actH1 scaleH1 vecfold
  actH1 biasH1 actH1 0 vecadd
  actH1 actH1 $ sigmoid 0 vecmap
  ( Layer O )
  actH1 wghtsO output scale0 vecfold
  output biasO output 0 vecadd
  output output $ sigmoid 0 vecmap

;
Software Development Kit

Fig. 6. Overview of the overall concept of REXA-VM development (C-SN: C source code snippet, H-SN: C Header snippet, JS: JavaScript, FTH: Forth VM code definitions, JSON: JavaScript Object Notation, CG: Code generator, CC: C Compiler, CP: ConPro HLS, SYN: RTL synthesis tool)
The ISA is defined by a collection of code snippets and macro definitions

- There are different implementations for different host platforms (or OS)
- There are different implementations for software and hardware VMs

- All code and definitions are stored in a simple JSON data-base that can be accessed by various compiler programs

Git it here and try out: [https://github.com/bsLab/rexavm/](https://github.com/bsLab/rexavm/)
Resilience
Resilience and robustness on VM-level

0. KISS: VM architecture is simple and provides inherent safety due to its simplicity;

1. Enhanced error detection and error recovery due to virtualization and isolation of critical architecture components; a pure textual code and data VM input interface increases the probability of detecting communication errors (data corruption);

2. Strict separation of control and data stacks (r-stack is not accessible by user code);

3. Tasks can only access private data directly (data is embedded in their private code frames);

4. Ensemble VM execution (hardware or multi-core software implementation), executing the same code in parallel on multiple VM instances and comparing intermediate states and results (majority decision making; stopping of faulty computations);
Resilience and Robustness on VM-level

5. Check-pointing with optional persistent storage enabling stop-and-go (instead of stop-and-forget) processing (e.g., on irregular and short power cycles);

6. Exception handling;

7. Adaptivity due to incremental code execution (i.e., code updates overwriting older code via the global dictionary);

8. Hardware-Software-Simulation Co-design by unified DB-driven VM code generators enables the operational simulation, profiling, and test of real network nodes with the same operational semantics and discrete timing;

9. Optional special data codings (hardware, simulated in software) for improved error detection and error correction.
Performance
VM

1. Compilation (MCPS, Tokens)
2. Bytecode execution (MWPS, Bytecode Instructions)

<table>
<thead>
<tr>
<th>Target</th>
<th>Configuration</th>
<th>MIPS</th>
<th>MCPS</th>
<th>Code/Heap</th>
</tr>
</thead>
<tbody>
<tr>
<td>STM32 F103VC3, 72MHz, 256kB ROM, 48kB RAM</td>
<td>CS=1024, DS=256, RS=128, FS=64, Words=101</td>
<td>1.1 / 15k/MHz</td>
<td>0.1 / 1.4k/MHz</td>
<td>8/8 kB</td>
</tr>
<tr>
<td>STM32 F103VC3, 72MHz, 256kB ROM, 48kB RAM</td>
<td>CS=1024, DS=256, RS=128, FS=64, Words=64 (no double word arithmetic)</td>
<td>1.1</td>
<td>0.1</td>
<td>7/7 kB</td>
</tr>
<tr>
<td>STM32 F103VC3, 72MHz, 256kB ROM, 48kB RAM</td>
<td>CS=4096, DS=1024, RS=256, FS=128, Words=101</td>
<td>1.1</td>
<td>0.1</td>
<td>8/16 kB</td>
</tr>
<tr>
<td>STM32 L031, 16 MHz, 32 kB ROM, 8 kB RAM</td>
<td>CS=1024, DS=256, RS=32, FS=32, Words=101</td>
<td>0.24 / 15k/MHz</td>
<td>0.02</td>
<td>7.1/8 kB</td>
</tr>
<tr>
<td>i5-7300U, 3GHz 4 GB RAM</td>
<td>CS=16384, DS=4096, RS=1024, FS=256, Words=101</td>
<td>280 / 90k/MHz</td>
<td>27 / 9k/MHz</td>
<td>32/64 kB</td>
</tr>
</tbody>
</table>
VM

**Highlights**

- 1:70 → About 70 native machine instructions / VM instruction execution (ARM Cortex) or 1:15 (Intel x86)
- 1:700 → About 700 native machine instructions / Word compilation (ARM Cortex) or 1:100 (Intel x86)
- Only 13 nJ / VM instruction (ARM Cortex M0+)
- Only 130 nJ / Word compilation (ARM Cortex M0+)
- Computation times of medium sized ANNs is below 1 Second (ARM Cortex M0+, 16 MHz, typically in the Milliseconds range)
- Compilation times of medium sized programs is below 1 Second (typically in the Milliseconds range)
- Start-up time of VM is below 100 ms (typically in the Milliseconds range)
Vector Operations (ANN)

Fig. 7. Normalized computation times for ANNs of different size (with two, three, and four layers) and two different host platforms (Generic i5 x86 @2900 MHz and STM32F103C8 @72MHz) as a function of neurons.
Vector Operations (ANN)

Fig. 8. Code size of ANN as a function of the number of neurons
Summary

A stack-based virtual machine architecture for low-resource, tiny embedded systems was introduced and analyzed. The overhead, even on very low-resource systems, is low with respect to typical running times under energy constraints and tasks to be performed in real-time.
Summary

A stack-based virtual machine architecture for low-resource, tiny embedded systems was introduced and analyzed. The overhead, even on very low-resource systems, is low with respect to typical running times under energy constraints and tasks to be performed in real-time.

A major feature is the tight coupling of a Text-to-Bytecode compiler with the Bytecode interpreter, ensuring robustness, security, stability, and interoperability in strong heterogeneous environments.
A stack-based virtual machine architecture for low-resource, tiny embedded systems was introduced and analyzed. The overhead, even on very low-resource systems, is low with respect to typical running times under energy constraints and tasks to be performed in real-time.

A major feature is the tight coupling of a Text-to-Bytecode compiler with the Bytecode interpreter, ensuring robustness, security, stability, and interoperability in strong heterogeneous environments.

ML classification and regression models can be computed using integer arithmetic and a set of vector operations with low computation times and memory requirements.
Tiny Machine Learning Virtualization for IoT and Edge Computing using the REXA VM

Towards Learning Technical Systems

Stefan Bosse $^{1,2}$

Christoph Polle $^3$

$^1$University of Bremen, Dept. Mathematics & Computer Science, Bremen, Germany

$^2$University of Siegen, Dept. Mechanical Engineering, Siegen

$^3$Faserinstitut Bremen (FIBRE), Bremen, Germany