Apcluster sparse

APCLUSTER uses affinity propagation (Frey and Dueck, Science, 2007) to identify data clusters, using a set of real-valued pair-wise data point similarities as input. Each cluster is represented by a data point called a cluster center, and the method searches for clusters so as to maximize a fitness function called net similarity. The method is iterative and stops after maxits iterations (default of 500 — see below for how to change this value) or when the cluster centers stay constant for convits iterations (default of 50). The command apcluster(s,p,'plot') can be used to plot the net similarity during operation of the algorithm.

For N data points, there may be as many as \(N^2-N\) pair-wise similarities (note that the similarity of data point i to k need not be equal to the similarity of data point k to i). These may be passed to APCLUSTER in an \(N\times N\) matrix s, where s(i,k) is the similarity of point i to point k. In fact, only a smaller number of relevant similarities are needed for APCLUSTER to work. If only M similarity values are known, where \(M < N^2-N\), they can be passed to APCLUSTER in an \(M\times 3\) matrix s, where each row of s contains a pair of data point indices and a corresponding similarity value: s(j,3) is the similarity of data point s(j,1) to data point s(j,2).

APCLUSTER automatically determines the number of clusters, based on the input p, which is an \(N\times 1\) matrix of real numbers called preferences. p(i) indicates the preference that data point i be chosen as a cluster center. A good choice is to set all preference values to the median of the similarity values. The number of identified clusters can be increased or decreased by changing this value accordingly. If p is a scalar, APCLUSTER assumes all preferences are equal to p.

The fitness function (net similarity) used to search for solutions equals the sum of the preferences of the the data centers plus the sum of the similarities of the other data points to their data centers.

RETURN VALUES

idx: The identified cluster centers and the assignments of other data points to these centers. idx(j) is the index of the data point that is the cluster center for data point j. If idx(j) equals j, then point j is itself a cluster center.
dpsim: \(\sum\)(similarities of the data points to their cluster centers)
expref: \(\sum\)(preferences of the cluster centers)
netsim: net similarity (= dpsim + expref)

EXAMPLE

소스보기

PARAMETERS

The following parameters can be set by providing name-value pairs, eg, apcluster(s,p,'maxits',1000)

maxits: Any positive integer. This specifies the maximum number of iterations performed by affinity propagation. Default: 1000.
convits: Any positive integer. APCLUSTER decides that the algorithm has converged if the estimated cluster centers stay fixed for convits iterations. Increase this value to apply a more stringent convergence test. Default: 100.
dampfact: A real number that is less than 1 and greater than or equal to 0.5. This sets the damping level of the message-passing method, where values close to 1 correspond to heavy damping which may be needed if oscillations occur. Default: .9
plot: No value needed. This creates a figure that plots the net similarity after each iteration of the method. If the net similarity fails to converge, consider increasing the values of dampfact and maxits.
details: No value needed. This causes idx, netsim, dpsim and expref to be stored after each iteration.
nonoise: No value needed. Degenerate input similarities (eg, where the similarity of i to k equals the similarity of k to i) can prevent convergence. To avoid this, APCLUSTER adds a small amount of noise to the input similarities. This flag turns off the addition of noise.

소스보기

index를 만드는 모든 과정은 efficiency를 위한 tricky한 조작. 기본 아이디어는, 알고 있는 거리만으로 계산한다는 것. 예를들어, 다음과 같이 데이터를 만들고 함수를 부르면,

예제

중간 결과가 다음과 같다.

결과

ind1과 ind2는 각각 s(:,1)과 s(:,2)의 inverse permutation과 비슷하다. 예를들어 위에서 ind2는 차례로 46, 1, 47, 2, 10, 48, ...인데, s(:,2)의 46, 1, 47, 2, 10, 48, ... 번째 값은 1, 2, 2, 3, 3, 3, ...이다.

ind1s, ind1e, ind2s, ind2e는 각각 ind1, ind2의 시작과 끝을 나타내는 index. 예를들어, ind1[ind1s(k), ind1e(k)]하면 k값에 해당하는 s의 index가 나오게 된다.

계산상 편의는 있지만, 큰 이득은 없어보이고 데이터가 커졌을 때 병렬화 하기는 더욱 좋지 않다.

가장 큰 문제는, 메모리의 이득이 전혀 없다.