Introduction

Sourcescape is a graph document database for storing source code and querying it with a declarative constraint-based language called SrcLog. Instead of simply searching with keywords, a user can define constraints and get back code that matches those constraints.

Motivation

To understand the motivation for SourceScape, it is important to first understand the code search problem and how it differs from the text search problem.

In text-based search, the user passes in a query, like “quick,” and the search engine looks through a collection of documents to find the ones that contain the keyword. The search engine then returns these documents, while also highlighting where “quick” occurs in the document.

The central problem in text search is mapping query phrases to documents. There are simple queries like “quick,” but also complex Google-style queries like “how do I fold a suit for travel.” The challenge here is ambiguity as there is no perfectly correct way of doing this.

With source code, there is no such ambiguity. A piece of source code is not just a document. All source code can be decomposed into a data structure known as an abstract syntax tree that yields labels for a particular text snippets and denotes relationships between different snippets. Performing static analysis on the code will uncover additional relationships.

With these labels and relations, the user can make more specific statements of what to find, as well as link related snippets together. Instead of mapping queries to documents, the challenge of code search is fundamentally structural. The code search engine must define a traversal through a graph and create a table from that traversal.

Examples

A few motivating examples are helpful here to illustrate concretely what this means.

All express routes and the status codes they could return

express.get('/test', function(req, res) {
	res.send(200)
});

express.get('/test2', function(req, res) {
	if(req.q) {
		res.send(400)
	} else {
		res.send(200) {
	}
});
Route
Status Code

test

200

test2

400

test2

400

All the react-router routes that use a certain component

<Routes>
	<Route path="/availability/ens" element={<AvailabilityComponent availability={ensAvailability}/>}/>
  <Route path="/availability/ethers" element={<AvailabilityComponent availability={ethersAvailability}/>}/>
  <Route path="/availability/web3" element={<AvailabilityComponent availability={web3Availability}/>}/>
  <Route path="/registration" element={<RegistrationComponent 
    availability={ethersAvailability}
    doCommit={doCommit}
    doRegister={doRegister}
  />} />
  <Route path="*" element={<div>Not found</div>}/>
</Routes>
Route
Component

/availability/ens

AvailabilityComponent

/availability/ethers

AvailabilityComponent

/availability/web3

AvailabilityComponent

/registration

RegistrationComponent

All jobs for the api team that run before noon and use the cacheService

new CronJob({
	jobName: "clearCache",
	job: async () => {
		cacheService.clear()
	}
	schedule: { hour: 10, minute: 0 },
	team: 'api'
});

new CronJob({
	jobName: "updateCache",
	job: async () => {
		cacheService.clear()
	}
	schedule: { hour: 11, minute: 0 },
	team: 'api'
});
Job Name
Hour
Minute
Team

clearCache

10

0

api

updateCache

11

0

api

All method calls to Ethereum Contracts in React components

class EthereumNameServiceContract extends Contract {
  constructor(provider: ethers.providers.Provider) {
    super(
      '0x57f1887a8BF19b14fC0dF6Fd9B2acc9Af147eA85',
      [
				'function ownerOf(uint256 tokenId) public view returns (address)',
				'function available(uint256 id) view returns(bool)',
				'function register(uint256 id, address owner, uint256 duration) external returns (uint256)',
			],
      provider,
    )
  }
}

...

const ethersENS = new EthereumNameServiceContract(provider);

function SomeComponent () {
	ethersENS.available ...
	ethersENS.register ...
}
Contract
Address
Method
Component

EthereumNameServiceContract

0x57f1887a8BF19b14fC0dF6Fd9B2acc9Af147eA85

available

AppInner

EthereumNameServiceContract

0x57f1887a8BF19b14fC0dF6Fd9B2acc9Af147eA85

available

AppInner

EthereumNameServiceContract

0x57f1887a8BF19b14fC0dF6Fd9B2acc9Af147eA85

ownerOf

AvailabilityComponent

EthereumRegistrarContract

0x283Af0B28c62C092C9727F1Ee09c02CA627EB7F5

registerWithConfig

RegistrationComponent

EthereumRegistrarContract

0x283Af0B28c62C092C9727F1Ee09c02CA627EB7F5

rentPrice

RegistrationComponent

SrcLog

In order to tackle these problems, Sourcescape requires a query language that can express structure. Specifically, it should be able to assert facts about a particular object (there’s a function A named “hello”) as well as relationships between these objects (B is the first argument of A).

To support this, Sourcescape uses a stripped down variant of DataLog, a logic programming language that allows one to make assertions on a dataset, called SrcLog. This variant will be limited to two types of assertions:

fact(A).
relationship(A, B).

In addition to the usual DataLog syntax, SrcLog adds the ability to specify attributes for both facts and relationships.

fact(A)[attr="something"].
relationship(A, B)[attr="something"].

For our example, let's start with our express routes problem from above.

express.get('/test', function(req, res) {
	res.send(200)
});

express.get('/test2', function(req, res) {
	if(req.q) {
		res.send(400)
	} else {
		res.send(200) {
	}
});

From this code, we want to pull the route name, as well as all the different status codes we’re sending back.

Route
Return

test

200

test2

400

test2

400

Let’s first break the problem up into it’s constituent pieces

  1. Route: Extracting the route from express.get(____)

  2. Status Code: Extracting the status code from res.send(____)

  3. Relationship: Establishing a relationship between the two.

Route

const express = require('express')

express.get('/test', ...)

In plain English, what we want to do is look for all places we call the get function from the express library and pluck out its first argument.

javascript::require(A)[name="express"].
javascript::member(A, B)[name="get"].
javascript::call(B, C).
javascript::call_arg(C, D)[index=1].

Unfortunately, while the SrcLog representation for this is quite simple, the realities of traversing the code graph are much more complicated.

The SrcLog above should be able to match on so many different variants that have different syntax but the same semantics:

//
const express = require('express');
express.get('/test', ...)

//
import { get } from 'express';
get('/test', ...)

//
const express = require('express');
const { get } = express;
get('/test', ...);

Within each of the relationship assertions, we need to be able to deal with this ambiguity.

Status Code

Extracting the status code also forces us to break down the structure of our search.

function(req, res) {
	res.send(400)
}

We want to find all functions where we call the send method of the second argument of that function. Then we want to pluck out the first argument of that call.

javascript::function(E).
javascript::function_arg(E, F)[index=2].
javascript::member(F, G)[name="send"].
javascript::call(G, H).
javascript::call_arg(H, I)[index=1].

Relationship

With both these pieces, we can stitch them together into a complete query that captures all the constraints we want.

The second argument of the call in the first query (C) has an argument at the second that is the function (E) in the second query.

Incrementally, we only add one constraint:

javascript::call_arg(C, E)[index=2].
%-- First query (extract route)
javascript::require(A)[name="express"].
javascript::member(A, B)[name="get"].
javascript::call(B, C).
javascript::call_arg(C, D)[index=1].

%-- Second query (extract status)
javascript::function(E).
javascript::function_arg(E, F)[index=2].
javascript::member(F, G)[name="send"].
javascript::call(G, H).
javascript::call_arg(H, I)[index=1].

%-- Bridge linking the queries
javascript::call_arg(C, E)[index=2].
Route (D)
Status (I)

test

200

test2

400

test2

400

SrcLog is a query language for a graph dataset. It allows us to declaratively and interactively specify a series of labelled constraints for subgraphs that we want to see. It will then return these subgraphs with nodes labelled as specified by the constraints.

Query execution

SrcLog is only the query language. Much like SQL, it hides all the complexities of query execution from the user and only presents a simplified declarative model for the user to express constraints. The query execution engine is what takes SrcLog and compiles it down into a program that can access raw data (from ElasticSearch in our case.)

This execution goes through three layers:

  1. Optimization

  2. Join

  3. Traversal

Optimization layer

SrcLog expresses a constraint graph, but it doesn’t make any statement about how to traverse that graph.

Let’s just take a part of our SrcLog above for simplicity.

javascript::require(A)[name="express"].
javascript::member(A, B)[name="get"].
javascript::call(B, C).

There are 3 ways of traversing this graph. The query engine could start at A by getting all require nodes named “express,” then get all members of that, and then getting all calls of that. Or it could start at B by getting all member nodes named “get.” Or it could start at C by getting all call nodes.

These 3 paths would return the same result but would have wildly different performance characteristics. If the query engine started by getting all call nodes at C, it would need to start searching from a huge number of nodes. It would waste a lot of effort paring down this initial node set with edge traversals. In contrast, if it started at A, it would start from a much smaller and more specific set of nodes.

Finding the optimal path out of all options is called the minimum spanning tree problem.

Join layer

Once the query engine has the minimum spanning tree, it needs to actually execute the graph search.

%-- First query (extract route)
javascript::require(A)[name="express"].
javascript::member(A, B)[name="get"].
javascript::call(B, C).
javascript::call_arg(C, D)[index=1].

%-- Second query (extract status)
javascript::function(E).
javascript::function_arg(E, F)[index=2].
javascript::member(F, G)[name="send"].
javascript::call(G, H).
javascript::call_arg(H, I)[index=1].

%-- Bridge linking the queries
javascript::call_arg(C, E)[index=2].

SrcLog is used interactively so one important requirement of query execution is to be able to stream results. This is necessary for several reasons:

  1. I want to be able to limit results. There are a lot of code queries that can return huge amounts of results. Very simply, if I search for every function or every call, I could get thousands, maybe close to a million results in a sufficiently large codebase.

  2. When I limit results, I want to do the minimum work required to get the results I want. Namely, I don’t want to do the same work for getting 10 results as I do for getting 1000 results.

  3. Related to 2, I do not want to move entire datasets of nodes in memory. I want to stream results into the next edge traversal.

A stream has a beginning and an ending. The join layer takes the minimum spanning tree and turns it into a stream. It starts from some initial node query and then streams results to each subsequent traversal. It also manages the forwarding of results to the next traversal. When a query branches like ours above, we need to join the leaves of the spanning tree to get this single stream to pull.

Traversal layer

Once we manage the branching, we need to deal with the complexity of simply going from Point A to Point B.

Expressions that are the same semantically can be wildly different syntactically. It is often necessary to follow references and other edge types to get to a particular edge type that the user is interested in.

//
const express = require('express');
express.get('/test', ...)

//
import { get } from 'express';
get('/test', ...)

//
const express = require('express');
const { get } = express;
get('/test', ...);

The traversal layer needs to be able to repeatedly follow some group of edge types and then terminate on a particular edge type.

Next

In the following sections we will discuss the above in reverse order:

Traversal Layer

Join Layer

Optimization Layer

Last updated