Architecture

    This document describes how the Simple IoT project fulfills the basic requirements as described in the top level README.

    IoT Systems are distributed systems🔗

    IoT systems are inherently distributed where data needs to be synchronized between a number of different systems including:

    1. Cloud (one to several instances depending on the level of reliability desired)
    2. Edge nodes (many instances)
    3. User Interface (phone, browser)

    IoT Distributed System

    Typically, the cloud instance stores all the system data, and the edge, browser, and mobile devices access a subset of the system data.

    Device communication and messaging🔗

    In an IoT system, data from sensors is continually streaming, so we need some type of messaging system to transfer the data between various instances in the system. This project uses NATS.io for messaging. Some reasons:

    • allows us to push realtime data to an edge device behind a NAT, on cellular network, etc -- no public IP address, VPN, etc required.
    • is more efficient than HTTP as it shares one persistent TCP connection for all messages. The overhead and architecture is similar to MQTT, which is proven to be a good IoT solution. It may also use less resources than something like observing resources in CoAP systems, where each observation requires a separate persistent connection.
    • can scale out with multiple servers to provide redundancy or more capacity.
    • is written in Go, so possible to embed the server to make deployments simpler for small systems. Also, Go services are easy to manage as there are no dependencies.
    • focus on simplicity -- values fit this project.
    • good security model.

    For systems that only need to send one value several times a day, CoAP is probably a better solution than NATS. Initially we are focusing on systems that send more data -- perhaps 5-30MB/month. There is no reason we can't support CoAP as well in the future.

    Data modification🔗

    Where possible, modifying data (especially nodes) should be initiated over nats vs direct db calls. This ensures anything in the system can have visibility into data changes. Eventually we may want to hide db operations that do writes to force them to be initiated through a NATS message.

    data flow

    Simple, Flexible data structures🔗

    As we work on IoT systems, data structures (types) tend to emerge. Common data structures allow us to develop common algorithms and mechanism to process data. Instead of defining a new data type for each type of sensor, define one type that will work with all sensors. Then the storage (both static and time-series), synchronization, charting, and rule logic can stay the same and adding functionality to the system typically only involves changing the edge application and the frontend UI. Everything between these two end points can stay the same. This is a very powerful and flexible model as it is trivial to support new sensors and applications.

    Constant vs Varying parts of System

    The core data structures are currently defined in the data directory for Go code, and frontend/src/Data directory for Elm code. The fundamental data structures for the system are Nodes, Points, and Edges. A Node can have one or more Points. A Point can represent a sensor value, or a configuration parameter for the node. With sensor values and configuration represented as Points, it becomes easy to use both sensor data and configuration in rule or equations because the mechanism to use both is the same. Additionally, if all Point changes are recorded in a time series database (for instance Influxdb), you automatically have a record of all configuration and sensor changes for a node.

    Treating most data as Points also has another benefit in that we can easily simulate a device -- simply provide a UI or write a program to modify any point and we can shift from working on real data to simulating scenarios we want to test.

    Edges are used to describe the relationships between nodes as a graph. Nodes can have parents or children and thus be represented in a hierarchy. To add structure to the system, you simply add nested Nodes. The Node hierarchy can represent the physical structure of the system, or it could also contain virtual Nodes. These virtual nodes could contain logic to process data from sensors. Several examples of virtual nodes:

    • a pump Node that converts motor current readings into pump events.
    • implement moving averages, scaling, etc on sensor data.
    • combine data from multiple sensors
    • implement custom logic for a particular application
    • a component in an edge device such as a cellular modem

    Edges can also contain metadata (Value, Text fields) that further describe the relationship between nodes. Some examples:

    • role the user plays in the node (viewer, admin, etc)
    • order of notifications when sequencing notifications through a node's users

    Being able to arranged nodes in an arbitrary hierarchy also opens up some interesting possibilities such as creating virtual nodes that have a number of children that are collecting data. The parent virtual nodes could have rules or logic that operate off data from child nodes. In this case, the virtual parent nodes might be a town or city, service provider, etc., and the child nodes are physical edge nodes collecting data, users, etc.

    Node Tree🔗

    The same Simple IoT application can run in both the cloud and device instances. The node tree in a device would then become a subset of the nodes in the cloud instance. Changes can be made to nodes in either the cloud or device and data is sycnronized in both directions.

    cloud device node tree

    The following diagram illustrates how nodes might be arranged in a typical system.

    node diagram

    A few notes this structure of data:

    • A user has access to its child nodes, parent nodes, and parent node descendants (parents, children, siblings, nieces/nephews).
    • Likewise, a rule node processes points from nodes using the same relationships described above.
    • A user can be added to any node. This allows permissions to be granted at any level in the system.
    • A user can be added to multiple nodes.
    • A node admin user can configure nodes under it. This allows a service provider to configure the system for their own customers.
    • If a point changes, it triggers rules of upstream nodes to run (perhaps paced to some reasonable interval)
    • The Edge Dev Offline rule will fire if any of the Edge devices go offline. This allows us to only write this rule once to cover many devices.
    • When a rule triggers a notification, the rule node and any upstream nodes can optionally notify its users.

    The distributed parts of the system include the following instances:

    • Cloud (could be multiple for redundancy). The cloud instances would typically store and synchronize the root node and everything under it.
    • Edge Devices (typically many instances (1000's) connected via low bandwidth cellular data). Edge instances would would store and synchronize the edge node instance and descendants (ex Edge Device 1)
    • Web UI (potentially dozens of instances connected via higher bandwidth browser connection).

    As this is a distributed system where nodes may be created on any number of connected systems, node IDs need to be unique. A unique serial number or UUID is recommended.

    Synchronization🔗

    NOTE, other than synchronization of node points, which is a fairly easy problem, this section in a WIP

    See research for information on techniques that may be applicable to this problem.

    Typically, configuration is modified through a user interface either in the cloud, or with a local UI (ex touchscreen LCD) at an edge device. As mentioned above, the configuration of a Node will be stored as Points. Typically the UI for a node will present fields for the needed configuration based on the Node Type, whether it be a user, rule, group, edge device, etc.

    In the system, the Node configuration will be relatively static, but the points in a node may be changing often as sensor values changes, thus we need to optimize for efficient synchronization of points. We can't afford the bandwidth to send the entire node data structure any time something changes.

    As IoT systems are fundamentally distributed systems, the question of synchronization needs to be considered. Both client (edge), server (cloud), and UI (frontend) can be considered independent systems and can make changes to the same node.

    • An edge device with a LCD/Keypad may make configuration changes.
    • Configuration changes may be made in the Web UI.
    • Sensor values will be sent by an edge device.
    • Rules running in the cloud may update nodes with calculated values.

    Although multiple systems may be updating a node at the same time, it is very rare that multiple systems will update the same node point at the same time. The reason for this is that a point typically only has one source. A sensor point will only be updated by an edge device that has the sensor. A configuration parameter will only be updated by a user, and there are relatively few admin users, and so on. Because of this, we can assume there will rarely be collisions in individual point changes, and thus this issue can be ignored.

    Synchronization is managed using the node Hash field and point Time fields . Because there is typically only one distributed instance updating a point value (sensor, user, etc), we simply consider the point with the latest time stamp the current value. Any time a point is requested or changed, it is broadcast via NATS. If the time stamp in the incoming point is newer than the locally stored point, you update the local copy. If the local copy is newer, then broadcast the local copy because someone else needs a newer copy. If a complete copy of a node is received, iterate through the points and replace points that are older than the the ones in the incoming node.

    The node Hash field is a hash of:

    • node point timestamps
    • node type
    • and child node Hash fields

    TODO: hashing the node seems to be the same concept as used by Merkle Trees, see research

    Comparing the node Hash field allows us to detect node differences. We then compare the node points to determine the actual differences.

    Any time a node point is modified, the node's Hash field is updated, and the Hash field in parents, grand-parents, etc are also computed and updated. This may seem like a lot of overhead, but if the database is local, and the graph is reasonably constructed, then each update might require reading a dozen or so nodes and perhaps writing 3-5 nodes. An indexed read in Genji is orders of magnitude faster than a write (at least for Bolt), so this overhead should be minimal. Again, we are optimizing for small/mid size IoT systems. If a point update requires 50ms, the system can handle 20 points/sec. If the average device sends 0.05pt/sec, then we can handle 400 devices. Switching storage from Bolt to Badger will likely improve this by an order of magnitude, so that puts us well into the 1000's of devices. (all this needs tested to confirm it is practical)

    TODO: how to handle node and point deletions.

    TODO: how to hande node type changes.

    There are two things that need to be synchronized:

    1. Node point changes -- this happens when config/sensor data changes.
    2. Node topology changes -- includes adding/deleting nodes.

    There are two synchronization cases:

    1. Catch up -- This is the case where one system starts after another and must catchup to any changes.
    2. Run time -- This is the case where two systems have "caught up" and need to stay synchronized.

    Point synchronization🔗

    Point changes are handled by sending points to a NATS topic for a node any time it changes. There are three primary instance types:

    1. Cloud: will subscribe to point changes on all nodes (wildcard)
    2. Edge: will subscribe to point changes only for the nodes that exist on the instance -- typically a handful of nodes.
    3. WebUI: will subscribe to point changes for nodes currently being viewed -- again, typically a small number.

    With Point Synchronization, each instance is responsible for updating the node data in its local store.

    Node topology changes🔗

    Node topology changes happen when:

    1. A node is added or deleted.
    2. An edge is added or deleted.

    TODO: figure out how to synchronize these changes.

    Catch up synchronization🔗

    So for every node modification, the root node of the graph is updated. To synchronize the graph between systems, execute the following steps:

    1. Start at root node.
    2. Does the Hash field match?
    3. If not push the node ID into a queue, fetch node's children and compare Hash fields. For nodes where Hash does not match, continue fetching children until you reach a point where all children match.
    4. Once you are at the bottom of the graph, walk back up the graph by popping a node ID off the queue and synchronize that node's point data, recompute hash, etc.

    It should be noted that run-time synchronization is running while catch-up synchronization is running. The catch-up process should be a background, low priority process and may take a number of passes to complete.

    Other synchronization methods often rely on storing the entire history as a set of changes or even complete versions, and then replay changes. In an IoT system where sensors are updating values often, a scheme will not work very well.

    Extensible architecture🔗

    Any siot app can function as a standalone, client, server or both. As an example, siot can function both as an edge (client) and cloud apps (server).

    • full client: full siot node that initiates and maintains connection with another siot instance on a server. Can be behind a firewall, NAT, etc. May eventually use NATS leaf node functionality for this.
    • server: needs to be on a network that is accessible by clients

    We also need the concept of a lean client where an effort is made to minimize the application size to facilitate updates over IoT cellular networks where data is expensive.

    Frontend architecture🔗

    Much of the frontend architecture is already defined by the Elm architecture. The current frontend is based on the elm-spa.dev project, which defines the data/page model. Data is fetched using REST APIs, but eventually it may make sense to get real-time data via NATS.

    We'd like to keep the UI optimistic if possible.