Traversal Layer

A traversal is "walking instructions" from a starting node set to a desired destination node set.

const a = require('express');
const b = a;

const c = b.use;

c();
b.use();

Suppose we want to find every time the 'use' function in express is called. We need to start from the require node (A), follow all assignments and references until getting to the 'use' member (B). This gets us our first node set. From there, we need to again follow all assignments and references until getting to the call (C.)

Visualized in batch, the purpose of the traversal layer is to create a mapping from a set of nodes A to a set of nodes B, based on some edge and node relationships between A and B. These traversals can be chained together, as we do here with A→B→C.

Storage

The underlying datastore for SourceScape is ElasticSearch, which powers the raw graph retrieval. We use ElasticSearch because of its speed and how well suited it is to document retrieval by a number of dynamic and potentially sparse indexes right out of the box. Retrieval could be potentially implemented with a number of other datastores, including SQL, but ElasticSearch is a sensible first choice.

There are two indexes in ElasticSearch, one for nodes and the other for edges.

Nodes have an id and type, as well as name and index attributes that provide a place to store things like the name of a class, function, etc. or the index of a function argument (first or second argument, etc.)

{
	id
	type
	name
	index
}

Edges have a from key, a to key, a type, as well as name and index attributes.

{
	from
	to
	type
	name
	index
}

The indexes can be queried on any combination of these fields. Node queries almost always include type, edge queries almost always include type and either from or to.

>> javascript_nodes
{
	term: {
		type: "function"
	}
}
>> javascript_edges
{
	{
		term: {
			type: "function_arg",
			index: 1,
			from: [_id1, _id2, ...]
		}
	}
}

Traversal

To define a single traversal from a starting node set, we specify a series of edge types to follow and an edge type to terminate on.

	.traverse[
		follow=edge_types:(
			javascript::declared_as,
			javascript::assigned_as,
			javascript::reference_of
		),
		target=edge_types:(
			javascript::member_of[name="use"]
		)
	]

To form a full traversal query, we chain these traversals together and define a root. The root can either be a node query, as it is here, or a join to another traversal query (from the join layer)

root[type=require, name="express"]
	.traverse[
		follow=edge_types:(
			javascript::declared_as,
			javascript::assigned_as,
			javascript::reference_of
		),
		target=edge_types:(
			javascript::member_of[name="use"]
		)
	]
	.traverse[
		follow=edge_types:(
			javascript::declared_as,
			javascript::assigned_as,
			javascript::reference_of
		),
		target=edge_types:(
			javascript::call_of
		)
	]

The root will return a set of nodes, represented as a ring here. Each of the nodes is a pointer to a location in code.

To perform a traversal, the query engine simply needs to perform a breadth first search from the initial input node set.

let current_frontier = input_set
let results = []
while !current_frontier.is_empty:
	next_edges = edge_hop(current_frontier)
	next_frontier = []
	for edge in next_edges:
		if edge.is_terminal:
			results.append(edge)
		else:
			next_frontier.append(edge)
	current_frontier = next_frontier

When the traversal is complete, we have a full result set with a number of different paths from the input set to the result set.

Streaming

Performing this traversal in batch is conceptually simple, but doing so would require us to constantly have every edge in the frontier set in memory. Instead, we would like to save on both memory and work by performing the traversal with streams.

Akka Streams

SourceScape uses pull-based reactive programming using Akka Streams.

Akka Streams has three types of stream processing units. There are Sources, Flows, and Sinks. Sources create data. Then emit a stream of data (each of the beads here.) Flows transform data from one type to another. Sinks consume data and on completion of the stream, returns a “materialized value.”

The Sink we’re using is preset. It simply takes in JSON data and emits it to our frontend as a server-side stream (SSR.)

To build traversal, we need to define the correct Source and Flows to power this Sink. As a side note, a Source[A] combined with a Flow[A, B] is simply a Source[B].

Source

Our root search is very simple and straight forward. It is simply a Source tied to an ElasticSearch query. If I request one item, it will run the query.

The computation of the initial set is to continuously pull from ElasticSearch. Pulling triggers an ElasticSearch query that will return results using ElasticSearch's scrolling ability. Let's say we return 100 results. Then we can pull 100 pieces of data before it will trigger another ElasticSearch query.

Flow

With a Source producing a steady flow of node ids, we need a Flow to traverse from these node ids according to our edge traversal algorithm. As in our batch version, we need to perform an edge hop, which is an individual call to our ElasticSearch edge index.

However, in the streamed version we cannot query everything at once. Instead, we chunk our input streamed into a fixed size and perform an edge hop on this. We look at the results for this edge hop. We split the results into terminal and nonTerminal edges. The nonTerminal edges are streamed into the same edge hop flow, which creates another Source. The terminal edges are mergeSorted with the Source.

Additional features

Traversal is the lowest level layer, so there are a lot of needs imposed on it by higher layers.

Sorting (join layer)

As an added layer of complexity, the query engine needs to return everything in id order and to keep it sorted. This is because the next layer (join layer) depends on ordering to perform merge joins.

Unfortunately, we can't just maintain the sort in stream. This is because each EdgeHop has the potential of scrambling the order of the list. An EdgeHop is an ElasticSearch query that can only sort by a value in the ElasticSearch index. It can't sort by external values.

Node >>
	n1,
	n2,
	n3

Edge Hop [1] >>
	n1->r1
	n1->r2
	n1->r3
	n2->r2

Edge Hop [2] >>
	[n1] r1->j1
	[n1] r2->j2
	[n2] r2->j4 <<< out of order because how can we sort by n2??
	[n1] r3->j3

So for each edge hop, we need to consume in chunks and sort in memory using the entire graph trace. This is expensive, but it is memory capped and computation capped by the chunk size we define.

node traversals (optimization layer)

For many traversals, we need to provide information from the SrcLog layer to filter out the final node.

If we have a traversal that starts with a root[type=require], we will get a node traversal when we reverse that:

.node_traverse[
  follow=edge_types:(
      javascript::declared_as.reverse,
      javascript::assigned_as.reverse,
      javascript::reference_of.reverse
  ),
  target=(
      type=require
  )
]

Node traversals are very straight forward. They are like edge_traverses except instead incorporating a target edge into the edge hop, we run a check on the node itself.

Reversibility (optimization layer)

Because SrcLog allows us to traverse a constraint graph in any direction, we need to be prepared to go through a traversal in reverse.

This requires us to add a few traversal types to our repetoire.

A reverse traversal takes a set of EdgeTraversals and goes in reverse.

If I can go from a require named 'express' to its members named 'use', I should also be able to start from all members named 'use' and go to all requires named 'express' it is a member of.

We will elaborate further on this in the optimization layer.

The reverse operator can take in multiple traversals as well. This tends to be defined at the SrcLog level so we want to be able to decouple user facing features from index features.

root[
  type=class
].traverse[
  target=edge_types:(
      javascript::class_method
  )
].traverse[
  target=edge_types:(
      javascript::method_function
  )
].traverse[
  target=edge_types:(
      javascript::function_argument
  )
]
root[
  type=function-arg
].reverse[
  follow=edge_types:(
      javascript::declared_as,
      javascript::assigned_as,
      javascript::reference_of
  ),
  traverses={
    traverse[
      target=edge_types:(
          javascript::method_function
      )
    ].
    traverse[
      target=edge_types:(
          javascript::function_argument
      )
    ]
  }
].reverse[
  follow=edge_types:(
      javascript::declared_as,
      javascript::assigned_as,
      javascript::reference_of
  ),
  traverses={
    traverse[
      target=edge_types:(
          javascript::class_method
      )
    ]
  }
]

<< A note on storage side

How it’s done: follows

A reversal of a series of edge traversals becomes gets remapped to a series of edge traversals using reversed edges.

However, the process is not quite straight forward. When we reverse the traversal, each hop in our traversal naturally ends after the reversed target but before the reversed follow. We cannot do the reversed follow because we have no idea when to end (!). Our terminal condition going forward was the target edge type.

This usually is not an issue as many traversals do not have follows, but it can be an issue for traversals like member_of.

Suppose we have a series of edge traversals that both have follows and a target. In the forward direction we go from A → B → C → D, but in reverse we actually land on C’ → B’ → A’ and terminate before the original A.

This is a problem we solve in the optimization layer by adding a node traverse after A’ to get us to A. We simply follow and terminate on A’s node type.

Last updated