Overlock Architecture

Terms

Before we dive in, here's a few terms used a lot in this document:

Diagnostics - The Overlock system incorporates logging, structured logging, 'lifecycle events', metadata and program state. Collectively we call this diagnostic information.

Instrumentation - This is a term we've borrowed from traditional Application Performance Management (APM). In brief, this is the process of adding logging and other tracing to code to allow understanding it's execution in the real-world.

Overlock Agent - The Overlock agent is a small piece of code which runs on your node and keeps a local cache of log messages and state information in case something goes wrong.

Overlock Client Libraries - These are libraries for Python/C/Java etc which provide a means to instrument application code. Internally these communicate with the Overlock Agent.

Overview

Overlock has been built from the ground up to be a high throughput logging and exception tracking system for IoT. The backbone of Overlock is in essence an IoT platform, except the data being ingested is debug information rather than sensor values!

An overview block diagram of how overlock works

Currently, Overlock uses a high-availability MQTT broker as it's data connection to devices, which allows for long-lived and low latency connections to devices at the edge.

Each IoT node which is connected to Overlock is able to store some diagnostic information about itself in an "Overlock Agent". When something goes wrong with the execution of the programs running on that device, or theres a power interruption etc, the Agent then reports back to the Overlock platform with a diagnostics report.

When the report reaches Overlock, it is processed and stored in Elastic Search. One of the most unique things about Overlock is that at this stage, Overlock will automatically gather additional diagnostics reports from other devices in the IoT deployment which are related to the device which observed the problem.

What can be reported to Overlock?

Overlock has been designed to be a perfect fit for the unique needs of logging and debugging in IoT. As Such, with Overlock you can report:

  1. Logs - Regular text logs are available (e.g. console.log() or printf()) type logging.
  2. Logging with metadata - Gateway nodes are able to send logs which are tagged as being related to, or on behalf of, another node. This permits use-cases where logging is being done to track data movement between parts of a system.
  3. Structured Logging (coming soon) - In structured logging a full object is sent to Overlock, permitting a far richer amount of data to be sent.
  4. Program State - Overlock allows storage of a state object which can be used to keep track of information like sensor readings, loop variables and other information which can be helpful when something goes wrong.
  5. Metadata - This is information about a node. Often this will include the Product, software versions, user id's and other data which can help in finding nodes as well as unpicking patterns when multiple nodes see the same errors.

Overlock Nodes

Overlock "Nodes" are an IoT device which can connect to the internet and are capable of running the "Overlock Agent". Currently the agent is supported on most flavors of embedded Linux and we're looking to bring support for RTOS's too.

When each node in an IoT deployment is running the Overlock agent, those nodes are able to report diagnostics data to the Overlock agent, which then keeps a rolling cache of that information.

At present, the rolling cache is held in RAM and will use a couple of MB of RAM to store log messages and program state. In the future we may add support for storing the cache to a file or other custom means of storing the cache in a non-volatile location.

The Overlock agent is a simple install with a .deb package or a .snap package, or can be installed from source. The agent needs minimal configuration.

Overlock Platform

The Overlock platform is a hosted solution which ingests data and co-ordinates collection of additional information from nodes which are connected to the system.

The platform has a few types of Entities:

  1. Nodes - A device or piece of software which can connect to Overlock to report diagnostics information.
  2. Product - All Nodes are associated with a Product, which is a logical grouping.
  3. Event - Events incorporate pretty much any diagnostics information reported from a Node.
  4. Issue - In some cases an Event will indicate an error. When this is the case, an Issue is generated to help debug it.

The Overlock web interface is largely based around searching and viewing Issues, however there is also a timeline view for Nodes, which can be useful for understand the state and execution of code on a particular Node.

The following screen shot shows a stream of events for a particular issue, along with the state at the time of the error:

Overlock Internals

Internally Overlock has been built to ingest millions of data points per hour. It's been built using elements of IoT platforms, as well as some of the best logging tools available.

Each IoT node maintains an MQTT(s) connection to the Overlock platform in addition to any other connections the device has. Due to this, Overlock works with any IoT platform or protocol. In addition to the broad compatibility, we also chose to implement the architecture in this way because it means that the debugging platform is completely independent and able to monitor problems on other parts of the system when you need to most (i.e. when the IoT platform is malfunctioning!).

Devices authenticate with Overlock using a combination of an Overlock provided project id and token, along with a user provided product id and device id, which means that developers can continue to use the same id's as in their current deployment.

Each customers data is held in separate elastic search indexes, however we are also able to offer custom deployments for complete data isolation at the server level.