Semnet: Semantic networks from embeddings

I’m happy to introduce Semnet, a small Python library which efficiently constructs graph structures from embeddings.

The name “Semnet” derives from semantic network, as it was initially designed for an NLP use-case, but it will work well with any form of embedded document (e.g., images, audio, even or graphs).

In this post, I’ll quickly run over some of the features and use cases, discuss why Semnet exists and the benefits of graph structures in natural language processing and beyond.

If you want to know more, I have another hands-on post providing examples of Semnet in action. You can also read the docs and check out the repository on Github.

If you’d just like to see something cool you can do with Semnet, click on the image below for an interactive visualisation of a network of quotes.

Cosmograph static image Graph built with Semnet and visualised in Cosmograph. Data: m-ric via Hugging Face.

Features

Semnet is a relatively small library, which does just one thing: efficiently construct network structures from embeddings.

Its key features are:

Use cases

Semnet may be used for:

Quick start

You can easily install Semnet with pip.

pip install semnet

All you need to start building your network is a set of embeddings and (optionally) some labels.

from semnet import SemanticNetwork, to_pandas
from sentence_transformers import SentenceTransformer

# Your documents
docs = [
    "The cat sat on the mat",
    "A cat was sitting on a mat",
    "The dog ran in the park",
    "I love Python",
    "Python is a great programming language",
]

# Generate embeddings (use any embedding provider)
embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
embeddings = embedding_model.encode(docs)

# Create and configure semantic network
sem = SemanticNetwork(thresh=0.3, distance="angular")

# Build the semantic graph from your embeddings
G = sem.fit_transform(embeddings, labels=docs)

# Analyze the graph
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")

# Export to pandas
nodes_df, edges_df = to_pandas(G)

What problem does Semnet solve?

Graph construction entails finding pairwise relationships (edges) between entities (nodes) in a dataset.

For large corpora, scaling problems rapidly become apparent as the number of possible pairs in a set scales quadratically.

$$pairs = \frac{n(n-1)}{2}$$

Naive approach

If we were to naively attempt to construct a graph from a modestly-sized set of documents we encounter problems early on with modestly-sized corpora. For example, building a graph from 10,000 documents would entail operations across 50 million pairs, for 100,000 it’s around 5 billion!

Iterating over each pair is of course very slow. Faster approaches exist via scikit-learn, but here we run into a larger problem: it’s memory intensive:

from sklearn.metrics import DistanceMetric
import numpy as np

dist = DistanceMetric.get_metric("euclidean")

# Generate 10_000 random embeddings
embeddings = np.random.rand(100_000, 768)
dist_scores = dist.pairwise(embeddings)

>> MemoryError: Unable to allocate 74.5 GiB for an array
with shape (100000, 100000) and data type float64

With Semnet

Semnet solves the scaling problem using Approximate Nearest Neighbours search with Annoy.

Instead of making comparisons between each document in the corpus, Semnet indexes the embeddings, iterates over each one, and returns a top_k best matches from within their neighbourhood.

Trying this again on the same rig with Semnet.

from semnet import SemanticNetwork

# Kick off timer again
start_time = time.time()

# Make 100,000 copies this time
embeddings = np.random.rand(100_000, 768)

# Build semantic network
semnet = SemanticNetwork(thresh=0.4, top_k=5)
G = semnet.fit_transform(embeddings)

# Close off the timer
end_time = time.time()
print(f"Processing time: {end_time - start_time:.2f} seconds")

>> Processing time: 24.26 seconds

We’re not only able to process all the embeddings without crashing our kernel, but it’s done in under 30 seconds.

Why use graph structures?

By opening up the NetworkX API to embedded documents, Semnet provides a new suite of tools and metrics for workflows in domains such as NLP, RAG, search and context engineering.

For most use cases, Semnet will work best as a complement to traditional spatial workflows, rather than as a replacement. Its power lies in encoding information about relationships between data points, which can be used as features in downstream tasks.

Approaches will vary depending on your use case, but benefits include:

Want to know more?