Writing an indexer
This page describes the SCIP Code Intelligence Protocol and how you can write an indexer to emit SCIP.
At a high level, you need to follow these steps:
- Familiarize yourself with the SCIP protobuf schema.
- Import or generate SCIP bindings.
- Generate minimal index with occurrence information.
- Test your indexer using scip CLI's
snapshot
subcommand. - Progressively add support for more features with tests.
If you run into problems or have questions for any of these steps, please open an issue on the SCIP issue tracker.
Let's go over each step one-by-one.
Understanding the SCIP protobuf schema
The SCIP protobuf schema describes the structure of a SCIP index in a machine-readable format.
The main structure is an Index
which consists of a list of documents
along with some metadata.
Optionally, an index can also provide
hover documentation for external symbols
that will not be indexed.
A Document
has a unique path relative to the project root.
It also has a list of occurrences,
which attach information to source ranges,
as well as a list of symbols that are defined
in the document.
The information covered by an Occurrence
can be syntactic or semantic:
- Syntactic information such as the
syntax_kind
field is used for highlighting. - Semantic information such as the
symbol
andsymbol_role
fields are used to power code navigation features like Go to definition and Find references.
Occurrences also allow attaching diagnostic information, which can be used by static analysis tools.
For more details, see the doc comments in the SCIP protobuf schema.
You may also find it helpful to see how existing indexers emit information. For example, you can take a look at the scip-typescript or scip-java code to see how they emit SCIP indexes.
Importing or generating SCIP bindings
The SCIP repository contains bindings for several languages.
Depending on your indexer's implementation language, you can import the bindings directly using your language's package manager, or by using git submodules. One benefit of this approach is that you do not need to have a protobuf toolchain to generate code from the schema. This also makes it easier to bump the version of SCIP to pick up newer changes to the schema.
Alternately, you can vendor the SCIP protobuf schema into your repository and set up Protobuf generation yourself. This has the benefit of being able to control the process from end-to-end, at the cost of making updates a bit more cumbersome.
Newer Sourcegraph versions will maintain backwards compatibility with older SCIP versions, so there is no risk of not being able to upload SCIP indexes if a vendored schema has not been updated in a while.
Generating minimal index with occurrence information
As a first pass, we recommend generating occurrences for a subset of declarations and checking that the generation works from end-to-end.
In the context of an indexer, this typically involves using a compiler frontend or a language server as a library. First, run the compiler pipeline until semantic analysis is completed. Next, perform a top-down traversal of ASTs for all files, recording information about different kinds of occurrences.
At the end, write a conversion pass from the intermediate data to SCIP using the SCIP bindings.
As a convention, indexers should use index.scip
as the default filename
for the output. The Sourcegraph CLI recognizes this filename and uses
it as the default upload path.
You can inspect the Protobuf output using protoc
:
# assuming scip.proto and index.scip are in the current directory
protoc --decode=scip.Index scip.proto < index.scip
For robust testing, we recommend making sure that the result of indexing is deterministic. One potential source of issues here is non-determinstic iteration over the key-value pairs of a hash table. If re-running your indexer changes the order in which occurrences are emitted, snapshot testing may report different results.
Snapshot testing with scip CLI
One of the key design criteria for SCIP was that it should be easy to understand an index file and test an indexer for correctness.
The scip CLI has a snapshot
subcommand
which can be used for golden testing.
It snapshot
command inspects an index file
and regenerates the source code,
attaching comments describing occurrence information.
Here is slightly cleaned up snippet from running
scip snapshot
on the index generated by
running scip-typescript
over itself:
function scriptElementKind(
// ^^^^^^^^^^^^^^^^^ definition scip-typescript npm @sourcegraph/scip-typescript 0.2.0 src/FileIndexer.ts/scriptElementKind().
node: ts.Node,
// ^^^^ definition scip-typescript npm @sourcegraph/scip-typescript 0.2.0 src/FileIndexer.ts/scriptElementKind().(node)
// ^^ reference local 1
// ^^^^ reference scip-typescript npm typescript 4.6.2 lib/typescript.d.ts/ts/Node#
sym: ts.Symbol
// ^^^ definition scip-typescript npm @sourcegraph/scip-typescript 0.2.0 src/FileIndexer.ts/scriptElementKind().(sym)
// documentation ```ts
// ^^ reference local 1
// ^^^^^^ reference scip-typescript npm typescript 4.6.2 lib/typescript.d.ts/ts/Symbol#
): ts.ScriptElementKind {
// ^^ reference local 1
// ^^^^^^^^^^^^^^^^^ reference scip-typescript npm typescript 4.6.2 lib/typescript.d.ts/ts/ScriptElementKind#
The carets and contextual information make it easy to visually check that:
- Occurrences are being emitted for the right source ranges.
- Occurrences have the expected symbol strings.
The exact syntax for the symbol strings is described
in the doc comment for
Symbol
in the SCIP Protobuf schema. - Symbols correspond to the right package.
For example, the
ScriptElementKind
is defined in thetypescript
package (the compiler) whereasscriptElementKind
is defined in@sourcegraph/scip-typescript
.
Progressively adding support for language features
We recommend adding support for different features in the following order:
- Emit occurrences and symbols for a single file.
- Iterate over different kinds of entities (functions, classes, properties etc.)
- Emit hover documentation for entities. If the markup is in a format other than CommonMark, we recommend addressing that difference after addressing other features.
- Add support for implementation relationships, enabling Find implementations.
- (Optional) If the hover documentation uses markup in a format other than CommonMark, implement a conversion from the custom markup language to CommonMark.