Zhang-Shasha: Tree edit distance in Python

The zss module provides a function (zss.distance) that computes the edit distance between the two given trees, as well as a small set of utilities to make its use convenient.

If you'd like to learn more about how it works, see References below.

Brought to you by Tim Henderson (tim.tadh@gmail.com) and Steve Johnson (steve@steveasleep.com).

Read the full documentation for more information.

Installation

You can get zss and its soft requirements ( editdist and numpy >= 1.7) from PyPI:

pip install zss

Both modules are optional. editdist uses string edit distance to compare node labels rather than a simple equal/not-equal check, and numpy significantly speeds up the library. The only reason version 1.7 of numpy is required is that earlier versions have trouble installing on current versions of Mac OS X.

You can install zss from the source code without dependencies in the usual way:

python setup.py install

If you want to build the docs, you'll need to install Sphinx >= 1.0.

Usage

To compare the distance between two trees, you need:

  1. A tree.
  2. Another tree.
  3. A node-node distance function. By default, zss compares the edit distance between the nodes' labels. zss currently only knows how to handle nodes with string labels.
  4. Functions to let zss.distance traverse your tree.

Here is an example using the library's built-in default node structure and edit distance function

from zss import simple_distance, Node

A = (
    Node("f")
        .addkid(Node("a")
            .addkid(Node("h"))
            .addkid(Node("c")
                .addkid(Node("l"))))
        .addkid(Node("e"))
    )
B = (
    Node("f")
        .addkid(Node("a")
            .addkid(Node("d"))
            .addkid(Node("c")
                .addkid(Node("b"))))
        .addkid(Node("e"))
    )
assert simple_distance(A, B) == 2

Specifying Custom Tree Formats

Specifying custom tree formats and distance metrics is easy. The zss.simple_distance function takes 3 extra parameters besides the two tree to compare:

  1. get_children - a function to retrieve a list of children from a node.
  2. get_label - a function to retrieve the label object from a node.
  3. label_dist - a function to compute the non-negative integer distance between two node labels.

Example

#!/usr/bin/env python

import zss

try:
    from editdist import distance as strdist
except ImportError:
    def strdist(a, b):
        if a == b:
            return 0
        else:
            return 1

def weird_dist(A, B):
    return 10*strdist(A, B)

class WeirdNode(object):

    def __init__(self, label):
        self.my_label = label
        self.my_children = list()

    @staticmethod
    def get_children(node):
        return node.my_children

    @staticmethod
    def get_label(node):
        return node.my_label

    def addkid(self, node, before=False):
        if before:  self.my_children.insert(0, node)
        else:   self.my_children.append(node)
        return self

A = (
WeirdNode("f")
    .addkid(WeirdNode("d")
    .addkid(WeirdNode("a"))
    .addkid(WeirdNode("c")
        .addkid(WeirdNode("b"))
    )
    )
    .addkid(WeirdNode("e"))
)
B = (
WeirdNode("f")
    .addkid(WeirdNode("c")
    .addkid(WeirdNode("d")
        .addkid(WeirdNode("a"))
        .addkid(WeirdNode("b"))
    )
    )
    .addkid(WeirdNode("e"))
)

dist = zss.simple_distance(
    A, B, WeirdNode.get_children, WeirdNode.get_label, weird_dist)

print dist
assert dist == 20

References

The algorithm used by zss is taken directly from the original paper by Zhang and Shasha. If you would like to discuss the paper, or the the tree edit distance problem (we have implemented a few other algorithms as well) please email the authors.

approxlib by Dr. Nikolaus Augstent contains a good Java implementation of Zhang-Shasha as well as a number of other useful tree distance algorithms.

Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18:1245–1262, 1989. (the original paper)

Slide deck overview of Zhang-Shasha

Another paper describing Zhang-Shasha



张沙莎:Python中的树编辑距离

zss 模块提供了一个功能( zss.distance ) 计算两个给定树之间的编辑距离,以及一个小集合 的公用事业使其方便。

如果您想了解更多有关其工作原理的信息,请参阅下面的参考资料。

蒂姆·亨德森(Tim Henderson)( tim.tadh@gmail.com )提供的信息 和史蒂夫·约翰逊( steve@steveasleep.com )。

阅读完整的文档以获取更多信息。

安装

您可以获得 zss 及其软要求( 来自PyPI的 editdist numpy > = 1.7)

pip install zss

这两个模块都是可选的。 editdist 使用字符串编辑距离 比较节点标签,而不是简单的等于/不等于检查,以及 numpy 显着加快了库的速度。唯一的原因版本 1.7的 numpy 是必须的,早期的版本有麻烦 在当前版本的Mac OS X上安装。

您可以从源代码中安装 zss ,而不需要依赖 通常的方式:

python setup.py install

如果您要构建文档,则需要安装Sphinx> = 1.0。

用法

要比较两棵树之间的距离,您需要:

  1. A tree.
  2. Another tree.
  3. A node-node distance function. By default, zss compares the edit distance between the nodes' labels. zss currently only knows how to handle nodes with string labels.
  4. Functions to let zss.distance traverse your tree.

这是使用库的内置默认节点结构和编辑的示例 距离功能

from zss import simple_distance, Node

A = ( Node("f") .addkid(Node("a") .addkid(Node("h")) .addkid(Node("c") .addkid(Node("l")))) .addkid(Node("e")) ) B = ( Node("f") .addkid(Node("a") .addkid(Node("d")) .addkid(Node("c") .addkid(Node("b")))) .addkid(Node("e")) ) assert simple_distance(A, B) == 2

指定自定义树格式

指定自定义树形格式和距离度量很容易。的 除了两个树之外, zss.simple_distance 函数需要3个额外的参数 比较:

  1. get_children - a function to retrieve a list of children from a node.
  2. get_label - a function to retrieve the label object from a node.
  3. label_dist - a function to compute the non-negative integer distance between two node labels.

示例

#!/usr/bin/env python

import zss

try: from editdist import distance as strdist except ImportError: def strdist(a, b): if a == b: return 0 else: return 1

def weird_dist(A, B): return 10*strdist(A, B)

class WeirdNode(object):

<span class="pl-k">def</span> <span class="pl-c1">__init__</span>(<span class="pl-smi"><span class="pl-smi">self</span></span>, <span class="pl-smi">label</span>):
    <span class="pl-c1">self</span>.my_label <span class="pl-k">=</span> label
    <span class="pl-c1">self</span>.my_children <span class="pl-k">=</span> <span class="pl-c1">list</span>()

<span class="pl-en">@</span><span class="pl-c1">staticmethod</span>
<span class="pl-k">def</span> <span class="pl-en">get_children</span>(<span class="pl-smi">node</span>):
    <span class="pl-k">return</span> node.my_children

<span class="pl-en">@</span><span class="pl-c1">staticmethod</span>
<span class="pl-k">def</span> <span class="pl-en">get_label</span>(<span class="pl-smi">node</span>):
    <span class="pl-k">return</span> node.my_label

<span class="pl-k">def</span> <span class="pl-en">addkid</span>(<span class="pl-smi"><span class="pl-smi">self</span></span>, <span class="pl-smi">node</span>, <span class="pl-smi">before</span><span class="pl-k">=</span><span class="pl-c1">False</span>):
    <span class="pl-k">if</span> before:  <span class="pl-c1">self</span>.my_children.insert(<span class="pl-c1">0</span>, node)
    <span class="pl-k">else</span>:   <span class="pl-c1">self</span>.my_children.append(node)
    <span class="pl-k">return</span> <span class="pl-c1">self</span>

A = ( WeirdNode("f") .addkid(WeirdNode("d") .addkid(WeirdNode("a")) .addkid(WeirdNode("c") .addkid(WeirdNode("b")) ) ) .addkid(WeirdNode("e")) ) B = ( WeirdNode("f") .addkid(WeirdNode("c") .addkid(WeirdNode("d") .addkid(WeirdNode("a")) .addkid(WeirdNode("b")) ) ) .addkid(WeirdNode("e")) )

dist = zss.simple_distance( A, B, WeirdNode.get_children, WeirdNode.get_label, weird_dist)

print dist assert dist == 20

参考文献

zss 使用的算法直接从原始文件中获取 张和沙沙如果你想讨论这篇文章,或者编辑树 距离问题(我们还实现了一些其他算法) 发送给作者。

Nikolaus Augstent博士的约简 包含了一个很好的Java实现的张沙沙以及一些 其他有用的树距离算法。

张开忠和丹尼斯·沙沙。用于编辑树木之间距离和相关问题的简单快速算法。 SIAM Journal of Computing,18:1245-1262,1989。(原文)

张沙沙幻灯片概述

另一篇描述张沙沙的文章




相关问题推荐