Reliability
The most important job of Artemis is to keep devices updatable. Since updates are delivered through the broker, it is critical that the device can reach the broker.
Strategy
The Artemis service that is installed on the devices periodically downloads its configuration from the broker.
If the device can't reach the broker, it will retry more and more aggressively. First it will reduce the interval between check-ins. At some point, it will turn off non-critical containers during connection attempts. Eventually, it will even disable critical containers for a short period of time.
All of these measures don't help if the broker is not available anymore. In that case, the device attempts to reach a recovery URL. This URL can provide a new broker configuration, allowing the device to connect again.
Being able to contact the broker means that the device can fetch new configurations, and, hopefully, also firmware. However, any new firmware isn't guaranteed to be able to do the same. As such, Artemis only commits to a new firmware after it has successfully connected to the broker with the new firmware.
As a last resort, Artemis also uses a watchdog. If the device can't connect to the broker for a long time, the watchdog will trigger and reboot the device. This also guards against the Artemis service itself getting into a bad state.
Max offline
The frequency at which Artemis contacts the broker can be configured using the
max-offline
setting in the pod specification. A value of 0s
means that the
device stays connected to the broker and polls it continuously (but at most
every 20 seconds). A value of 1h
means that the device can be offline for
up to an hour before it tries to reconnect.
Here is a pod specification with a max-offline
value of 1h
:
# yaml-language-server: $schema=https://toit.io/schemas/artemis/pod-specification/v1.json $schema: https://toit.io/schemas/artemis/pod-specification/v1.json name: my-pod sdk-version: v2.0.0-alpha.163 artemis-version: v0.25.0 max-offline: 1h firmware-envelope: x64-linux containers: {}
The max-offline value can also be set using the artemis
CLI:
This configuration change is not permanent and will be lost with the next firmware update.
Status states
The Artemis service on the device keeps track of its connection status. It is either green, yellow, orange or red. As of September 2024, the following heuristics are used to determine the status:
- green: the device is within the
max-offline
window. - yellow: the device is outside the
max-offline
window, but within a factor of 2. For example, ifmax-offline
is set to 1h, then the device is in yellow state if it hasn't connected for more than 1h, but less than 2h. - orange: the device is outside the
max-offline
window, but within a factor of 3. - red: the device is outside the
max-offline
window, and has been for more than 3 times themax-offline
value.
The following actions (again, as of September 2024) are taken based on the status.
Orange
When a device is in orange state, Artemis reduces the interval between reconnection
attempts by 50%. For example, if max-offline
is set to 1h, then the device will
try to reconnect every 30 minutes.
Artemis also disables all non-critical containers during reconnection attempts.
In this state, Artemis always tries a random entry from the recovery-URLs list to get a new broker configuration.
Red
Devices that can't connect to the broker for a long time are put into red state.
The red state is a more extreme version of the orange state. Artemis will reduce the
reconnect interval by a factor of 4. For example, if max-offline
is
set to 1h, then the device will try to reconnect every 15 minutes.
As for the orange state, Artemis disables all non-critical containers during reconnection attempts. However, now Artemis also disables critical containers 15% of the time.
Contrary to the orange state, Artemis does not always contact the recovery URL. In case the recovery-URL connection makes things worse Artemis only fetches recovery URLs 20% of the time.
Recovery URLs
Even though recovery URLs are baked into pods, they are stored as properties of the fleet. We don't expect recovery URLs to change often, and having them in the fleet configuration avoids duplication.
A recovery URL can be added to an existing fleet with the following command:
If the existing broker for a device doesn't exist anymore, then the device can
use the recovery URL to get a new broker configuration. The new broker configuration
can be generated by running artemis fleet recovery export -o recovery.json
and
serving the resulting recovery.json
file at the recovery URL.
We recommend to use a different domain as recovery URL. This way, the recovery URL is not affected by the same issues as the main broker.
Note that recovery URLs don't need to be online until they are needed.
Watchdogs
As mentioned earlier, Artemis uses watchdogs to monitor its own health. Given a
certain max-offline
value, the watchdog will trigger if the device hasn't connected
to the broker for more than 5 times the max-offline
value (or at least 2 hours, if
max-offline
is small).
This watchdog is a powerful last resort in that it handles many different failures. If Artemis itself gets into a bad state the watchdog will eventually help to recover. Similarly, user programs can communicate with the Artemis service through the Artemis package and a bug in the user program could accidentally disallow Artemis to connect to the broker.