5.4. SimRank
!"#
5.4 SimRank
SimRank is a general similarity measure proposed by Glen Jeh and Jennifer Widom [16] in
the eighth ACM SIGKDD conference. The measure is built upon a simple and intuitive graph-
theoretic model, and applied to any domain with object-to-object relationships. The purpose
of SimRank is to measure similarity between objects on the structural context. The intuition
behind SimRank is that two objects are similar if they are related to similar objects and objects are
similar to themselves. Many applications require a measure of similarity between objects, e.g. the
find-similar-document query in search engine, collaborative fil tering in a recommender system, in
which similar users and items are grouped based on the users’ preferences.
Let us model objects and relationships as a directed graph G =(V, E), where nodes in V
represent objects in the domain and edges in E represent relationships between objects. In Web
pages or scientific papers, which are homogeneous domains, nodes represent documents, and a
directed edge from p to q corresponds to a reference (hyperlink or citation) from document p to
document q. In a user-item bipartite domain, both users and items are represented with nodes in
V. A directed edge from p to q corresponds to a purchase (or other expression of preference) of
item q by user p. The result in this case leads to a bipartite graph, with users and items on either
side. Given a node v in the directed graph, I(v) and O(v) denote the set of in-neighbors and
out-neighbors of v, respectively. Individual in-neighbors are denoted as I
i
(v), for 1 ≤ i ≤ |I(v)|,
and individual out-neighbors are denoted as O
i
(v), for 1 ≤ i ≤ |O(v)|.
Let s(a, b) ∈ [0, 1] be the similarity between objects a and b , SimRank defines the measure
as the average similarity between in-neighbors of a and in-neighbors of b. It can be written in a
recursive equation as follows
s(a, b)=
C
|I(a)||I(b)|
|I(a)|
"
i=1
|I(b)|
"
j=1
s(I
i
(a),I
j
(b)) (5.10)
where C is a decay factor between 0 and 1, and gives the rate of decay as similarity flows across
edges. We realize that either a or b may not have any in-neighbors. In this case, the similarity
is set to zero, i.e. s(a, b)=0. Therefore, the summation part is defined to be 0 when I(a)=∅
or I(b)=∅. The recursive nature of SimRank resembles that of PageRank [12] to compute
importance scores for web pages.
In the bipartite domain of recommendation system, SimRank says people are similar if they
purchase similar items, and items are similar if they are purchased by similar people. Therefore, we can
measure the similarity between users and items. Let A and B be two users, their similarity can
$%"&'#(
43
)!*+",$