About Data Flow

What is Data Flow

While it might not look like it Data Flow is a mainstream technology as seen in,

  • Unix Pipes

  • DSP Programming // Max, Pure Data, NI

  • Visual Scripting // Unity, Unreal

  • Graphics Pipelines // Blender

  • Data Analytics // Pig, Apache Nifi

  • IOT // NoFlo, Node-Red

  • DAG Workflow // Luigi, Airflow

It is common in engineering disciplines too

  • LabView

  • Simulink

  • Ptolemy

  • PLC

Most of these approaches use push based approaches to implementing Data Flow. A complete Data Flow engine in some sense provides unix pipes + os + shell + top + ps + linker … Pull based Data Flow (FBP) is not popular yet, I think it is a matter of education, simplicity in implementation and apps. This aim of this book to help in the education part of both push and pull based approaches, among others. The subsequent chapters will explore the philosophy of Data Flow, walkthrough through implementations and plenty of practical examples.

Competing Technologies with Data Flow

  • Message Queues / Event Driven Programming / Actor Model

  • CSP

  • State Machines / HSM / Melay State Machines / REST

  • Linda / Entity Systems

  • Workflows

Currently Data Flow is mostly used for batch systems, but that need not be the case. With some tweaks Data Flow can effectively replace REST / microservices and even the interaction processing code for the front end. Flux and FRP show how signal flow based approaches are useful in reasoning complex interactions on the front end.

Unix pipes are the “worse is better” approach to flow based programming. Data Flow provides useful abstractions for modeling backend. It can provide a visual layer for backend much like what WebFlow does for CSS. You don’t have to use visual representations to use Data Flow though as both the code and visual representations map to each other 1-1. While you can write your backend, ui and frontend with just code adding a visual layer helps. One reason to favor text over visual is typing speed vs drawing speed.

FBP using Data Flow terminology is a pull based system as opposed to the push based systems that message queues and Unix pipes are, because of the suspend / resume semantics of co-routines. Push based systems are popular because they are easier to implement and worse is better.

While software systems have used messaging architectures and model driven development, Data Flow is broader than that and can cover all general programming cases. In fact the main goal of writing this book is allowing you to write your own Data Flow engine so that you don’t have to rely on complex tools that weigh a ton. Theoretically Data Flow is rooted in systems theory and there is long tradition in using Data Flow techniques in processors - Harward Architecture, Manchester, Instruction Scheduling in X86 - and database engines.

CS Theory relevant to Data Flow

FBP is completely asynchronous which makes it different from KPNs and Petri Nets, which have synchronization points.

With Data Flow running thousands of tests and simulations becomes trivial and it will be demonstrated in the subsequent sections. In many ways Data Flow / FBP resembles breadboard assembly like electronics. Electronics have a true component oriented architectures and better

  • Testability

  • Longevity

  • Reliability

  • Maintainability

  • Quality

Basics of Data Flow

  1. Data Flow is represented by a Graph, which is made of Nodes and Connections. Nodes are also called Actors / Systems / Components / Processes.

  2. Connections are like pipes. Also called Edges or Wires. They are connected to points called portlets / ports. Inlets, outlets signify the direction.

  3. The Data that moves is called Data / Token / Packets / Entities / Records / Messages.

  4. The directionality of movement implied by the terms Push / Pull or Data Driven / Demand Driven.

  5. The execution mechanism of each node

    • Reactive - Node is fired when the data arrives, asynchronously

    • Firing Rules - Node is fired when tokens match some firing rules

    • Classical - Node is fired by its own accord, when it is ready to pull data

See this presentation and this.

Some more useful terms

  • Coarse grained vs fine grained

  • Homogeneous Data vs Heterogeneous Data

  • Stream / Substream

  • Initial packet

  • Subgraph / Patch

  • Main Component

What is Data Flow useful for ?

  • Data Analysis

  • IOT

  • Games

  • Simulation

  • Systems Modeling

  • Visual Applications

  • Reactive Applications like Spreadsheet

  • Modular Applications