Algorithm description

Background

Detecting meaningful clusters such as protein (super)families and complexes from large biological networks is critically important in many areas of genomics research. In particular, graph-based approaches have been used successfully for studying the functional and evolutionary relationships between proteins (or genes) on genome-wide scale. Graphs (networks) provide a framework to model the various relationhips between the biological entities. For example, nodes and edges might correspond to proteins and sequence-/structure-based similarities, respectively. Given a similarity or distance matrix obtained from all-versus-all protein comparisons (e.g. using the BLAST algorithm), the graph algorithms use user-defined thresholds to delineate "similar" proteins into protein clusters. Amongst the algorithms, the standard connected-component search has been widely used in grouping proteins into protein families or orthologous groups owing to its simplicity, scalability and biological soundness [Koonin04].

Graph-based implementation

There are two distinct algorithmic approaches to this problem:
1. The first approach requires an entire graph to be stored in computer's memory prior to detecting the clusters using either depth-first search (DFS) or breadth-first search (BFS) algorithms [Grimaldi99]. This approach (denoted as 'in core' from here on) is, however, memory-expensive particularly for large graphs of millions of nodes and/or edges (O(E) or O(N²) space-complexity depending on the implementation, E=number of edges, N=number of nodes).
2. The second, more memory-efficient approach (denoted as 'external-memory' from here on) does not need the entire graph to be stored in computer's memory but instead, the clusters are constructed gradually while reading-in the graph from a hard disk, hence saving a lot of memory space. This can be achieved using well-known family of UNION-FIND algorithms (UFA) [Tarjan75]. In the netclust program we implemented the asymptotically optimal UFA variant with (nearly)linear time- and space-complexity (O(E * alpha(E)) time-complexity in the worst-case scenario, E=number of edges, alpha=inverse Ackerman's function; O(N) space-complexity, N=number of nodes), hence enabling the analyses of very large data sets in almost real-time. The algorithm as used in netclust involves three abstract operations: (i) populate singletons, (ii) find group memberships, and (iii) merge groups sharing at least one member. The (preliminary) clusters are stored as rooted trees which are then subjected to two post-processing steps. First, each tree is "compressed" so that all nodes (members) of a tree point directly to the root of that tree. Second, the resulting trees (clusters) are sorted by size and labeled by increasing integer values.

Data input and output

The netclust program takes an (un)directed weighted graph (network) in sparse matrix format as input and returns a list of clusters in a structured text file. In addition, the Multi-netclust tool enables multiple data networks to be combined using the kernel fusion (averaging) method proposed by [Kittler98], and then searched for clusters connected in all or in either of the networks. Prior to the cluster detection, the input graph must be indexed using the netindex utility to speed-up the the downstream processing. A user can set the similarity or distance cutoff value, depending on the input matrix, to control the sensitivity and specificity of results. For example, spurious similarities can be filtered out by choosing an appropriate cutoff value, which usually requires domain knowledge. Generally, a permissive cutoff may produce one large cluster while a strict cutoff may yield many singletons. The steps involved in the Multi-netclust workflow are schematically depicted in Figure 1.

Performance benchmark

Several artificial and real biological networks (Table 1) were used to compared the run-times and memory-usages of several available graph-based tools (Table 2). For each program we measured these values using the Perl's Benchmark module and the pmap Unix/Linux program, respectively. In total eleven networks were constructed using a simple Perl scripts and the BLAST program (version 2.2.17): eight artificial networks (denoted as T1-4, C1-4 and RN) and three biological networks constructed of UniProt and/or Refseq proteins and BLAST similarities (denoted as B1-3). Specifically, RN is a random Erdös-Rényi graph [Erdos59] whereas T1-4 and C1-3 are non-random 'thread-like' and 'cliquish' graphs, respectively. The benchmark analysis was conducted on an ordinary computer (Intel Pentium 4 3GHz CPU, 32 bit, 1GB RAM, 160GB SATA hard disk, SUSE Linux 10.0 operating system).

Table 1. Graphs (networks) used in the benchmark.
Graph	N	E	C	C_N	C_E	File size
T1	10,000	9,999	1	10,000	9,999	116K
T2	100,000	99,990	10	10,000	9,999	1.4M
T3	1,000,000	999,900	100	10,000	9,999	16M
T4	10,000,000	9,999,000	1,000	10,000	9,999	170M
C1	1,000	49,500	10	100	4,950	474K
C2	10,000	495,000	100	100	4,950	5.6M
C3	100,000	4,950,000	1,000	100	4,950	66M
RN	79,083	100,000	1,441	NA	NA	1.4M
B1	178,228	2,745,123	35,397	NA	NA	61M
B2	826,554	166,445,591	33,953	NA	NA	4.6G
B3	2,713,908	781,328,458	41,072	NA	NA	21G
N - total number of nodes, E - total number of edges, C - number of components, C_N - number of nodes per component, C_E - number of edges per component, NA - not applicable as the size of components varies.

Table 2. Available graph-based software.
Abbr.	Software	Source code	Ref.
NET	netclust (1.0)	C
BCL	BLASTClust, NCBI-BLAST package (2.2.18)	C	[1]
CLM	clmclose, MCL package (1.006)	C	[2]

Results

The benchmark results show that netclust is significatly faster and requires less memory than the other software (Table 3). Besides netclust, only BLASTClust was able to handle the largest data sets tested herein. Although the BLASTClust software is efficient and reliable, its use is limited to BLAST similarity networks while netclust is generally applicable. Moreover, netclust is faster (up to 100 times) and requires less memory (about 56%) than BLASTClust.

Table 3. Comparison of available graph-based tools.
Software	Graphs
Software	T1	T2	T3	T4	C1	C2	C3	RN	B1	B2	B3
NET	0.3s 0.5M	0.6s 4M	4.7s 38.3M	1m 12s 381.7M	0.5s 0.3M	1.5s 0.5M	15.1s 4M	0.8s 3.3M	0.5s^a 7.7M	2m 13s^a 32.3M	4m 47s^a 104.6MB
BCL	NA	NA	NA	NA	NA	NA	NA	NA	1m 20s^a 17.4M	3m 42s^a 39.7M	17m 40s^a 152M
CLM	2.8s 3.8M	23.2s 12.7M	3m 47s 115.7M	(32m 29s) (1G)	0.6s 2.7M	3.9s 9.2M	39s 81.6M	1m 31s 11.4M	26.8s 40.5M	(>3G)^c	(>3G)^c
Run-times given in seconds (s), minutes (m) or hours (h), and memory-usages given in megabytes (M) or gigabytes (G). NA - not applicable as BLASTClust can only handle BLAST similarity networks, results in parentheses '()' were obtained using a dedicated 64-bit computer with 32G RAM, ^arun-time obtained from precomputed and parsed/indexed BLAST similarity data, ^brun-time not determined due to the memory overflow, ^crun-time not determined due to premature termination of the matrix loading routine.

The scalability of the Multi-netclust tool to multiple networks was tested on the T4 example graph. The results of this analysis are shown in Table 4. The overall run-times roughly double with twice as many networks being analyzed. This is due to the increasing preprocessing time; the indexing and clustering times remained constant over all instances as the combined networks are identical to T4. Similarly, the amount of allocated memory remains constant, and is equivalent to the amount allocated for a single T4 network, which is 381.7M of RAM.

Table 4. Scalability of the Multi-netclust tool to multiple networks.
N	Networks	Run-time
2	T4 + T4	5m 27s
2	T4 x T4	6m 30s
4	T4 + T4 + T4 + T4	9m 10s
4	T4 x T4 x T4 x T4	10m 52s
8	T4 + T4 + T4 + T4 + T4 + T4 + T4 + T4	16m 26s
8	T4 x T4 x T4 x T4 x T4 x T4 x T4 x T4	18m 50s
Symbols 'x' and '+' refer to the product and sum aggregation rules, respectively.

References

[Koonin04] Koonin, E. V. et al.: 2004, A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes., Genome Biol 5, R7.
[Grimaldi99] Grimaldi, R. P.: 1999, Discrete and Combinatorial Mathematics, 4th ed., Addison-Wesley Longman, Inc.
[Tarjan75] Tarjan, R. E.: 1975, Efficiency of a Good But Not Linear Set Union Algorithm, J ACM 22, 215-225.
[Kittler98] Kittler, J. et al.: 1998, On combining classifiers. IEEE Trans Pattern Anal Mach Intell. 20, 226-239.
[Erdos59] Erdös, P. and Rényi, A.: 1959, On random graphs. Publ Math 6, 290-297.
[1] NCBI BLASTClust
[2] MCL