Tutorial

This tutorial provides a step-by-step guide to using NetworkSampler for space-filling sampling on networks and semi-supervised node classification via label propagation.

Package information

Start by checking the installed version and dependencies:

import networksampler
networksampler.info()

Generating a test network

NetworkSampler includes a utility to generate random networks for testing. For a more realistic scenario, we use NetworkX to generate a network with two communities:

import networkx as nx
import numpy as np

# Generate a network with 1000 nodes and two communities
G = nx.gaussian_random_partition_graph(1000, 500, 100, 0.20, 0.1)
A = nx.adjacency_matrix(G).todense()

print(f"Nodes: {A.shape[0]}")
print(f"Edges: {int(np.sum(A) / 2)}")

Computing the shortest path matrix

The space-filling algorithm operates on geodesic distances between nodes. You can precompute this matrix once and reuse it across multiple sampling runs:

D = networksampler.compute_shortest_length_paths_matrix(A)

print(f"Matrix shape: {D.shape}")
print(f"Max distance: {D.max():.0f}")
print(f"Mean distance: {D.mean():.2f}")

Space-filling sampling

The core function sa_sampling selects a set of seed nodes that are well spread across the network, minimizing the coverage criterion via simulated annealing:

# Sample 30 nodes (3% of 1000)
nodes, cost = networksampler.sa_sampling(A, 30, p=-4, q=4, r=0.1, D=D)

print(f"Sampled nodes: {nodes}")
print(f"Coverage cost: {cost:.4f}")

The parameters p and q control the coverage criterion. The paper recommends p=-4 and q=4 as they generally achieve the best accuracy. The parameter r is the cooling rate for simulated annealing.

If you have a precomputed shortest path matrix, pass it via the D parameter to avoid recomputing it.

For reproducible results, use the random_seed parameter:

nodes, cost = networksampler.sa_sampling(
    A, 30, p=-4, q=4, r=0.1, D=D, random_seed=42
)

Comparing with other sampling strategies

NetworkSampler also provides centrality-based random sampling through node_random_sample. You can use this to compare the space-filling design against other strategies:

# Random sample proportional to degree centrality
nodes_degree = networksampler.node_random_sample(A, 30, measure="degree")

# Random sample proportional to betweenness centrality
nodes_betw = networksampler.node_random_sample(A, 30, measure="betweenness")

# Random sample proportional to closeness centrality
nodes_close = networksampler.node_random_sample(A, 30, measure="closeness")

Available centrality measures are: degree, closeness, betweenness, and eigenvector.

Two-phase sampling for large networks

For networks larger than 5000 nodes, the full shortest path matrix becomes expensive to compute and store. The sa_sampling_twophases function addresses this with a two-phase procedure:

First, a subgraph is sampled using centrality-based node selection
Then, the space-filling design is applied on the subgraph

nodes, cost = networksampler.sa_sampling_twophases(
    A, ns=10, p=-4, q=4, r=0.1,
    centrality_measure="betweenness",
    first_phase_sample_fraction=0.2
)

print(f"Sampled nodes: {nodes}")
print(f"Coverage cost: {cost:.4f}")

The first_phase_sample_fraction controls the fraction of nodes retained in the first phase (default: 0.1). The centrality_measure parameter determines how nodes are selected in the first phase.

Computing node centrality measures

The utility function nodes_centrality_measures computes multiple centrality metrics for all nodes in the network:

from networksampler.utils import nodes_centrality_measures

pagerank, degree, closeness, betweenness = nodes_centrality_measures(A)

# Example: top 5 nodes by betweenness centrality
top5 = sorted(betweenness, key=betweenness.get, reverse=True)[:5]
print(f"Top 5 nodes by betweenness: {top5}")

Label propagation example

Here is a complete example combining NetworkSampler with label propagation:

import networkx as nx
import numpy as np
import networksampler

# 1. Generate a network with two communities
G = nx.gaussian_random_partition_graph(500, 250, 50, 0.15, 0.05)
A = nx.adjacency_matrix(G).todense()

# 2. Assign ground truth labels based on community membership
partition = G.graph['partition']
labels = np.zeros(500, dtype=int)
for i, community in enumerate(partition):
    for node in community:
        labels[node] = i

# 3. Select seeds using space-filling design
seeds, cost = networksampler.sa_sampling(A, 15, p=-4, q=4, r=0.1)

# 4. Label propagation using the graph Laplacian
n = A.shape[0]
W = np.array(A, dtype=float)
D_diag = np.diag(W.sum(axis=1).A1 if hasattr(W.sum(axis=1), 'A1') else W.sum(axis=1))
L = D_diag - W

# Partition into unlabeled (u) and seed (s) nodes
u_mask = np.ones(n, dtype=bool)
u_mask[seeds] = False
u_idx = np.where(u_mask)[0]
s_idx = seeds

Luu = L[np.ix_(u_idx, u_idx)]
Lus = L[np.ix_(u_idx, s_idx)]
fs = labels[s_idx].astype(float)

# Solve for unlabeled nodes
fu = np.linalg.solve(Luu, -Lus @ fs)
predicted = (fu >= 0.5).astype(int)

# 5. Evaluate accuracy
accuracy = np.mean(predicted == labels[u_idx])
print(f"Classification accuracy: {accuracy:.4f}")