小雷音寺

星期三, 九月 11, 2013

OCF开发者指南第十一章

11 安装和打包资源代理

这一章讨论在你编码和测试完成之后你要做的事情----在哪里安装，如何将它包含在你自己的应用包里或者Linux-HA资源代理库里面。

11.1 安装资源代理

如果你希望在你自己的项目中包含你的资源代理，必须将其安装在正确的位置。资源代理应该安装在/usr/lib/ocf/resource.d/ 目录，是你项目的名字或者其他你想用的名字。

比如，如果你的foobar 资源代理被打包成你fortytwo项目的一部分，那么其完整的路径应该是/usr/lib/ocf/resource.d/fortytwo/foobar。确定你的资源代理有755 (-rwxr-xr-x)权限。

当以此方式安装了，OCF兼容的集群资源管理将会正确的定位、解析和执行你的资源代理。比如Pacemaker集群管理会将会将上面的路径映射为ocf:fortytwo:foobar 资源代理标识。

11.2 打包资源代理

如果你的资源代理是你的项目的一部分，你应该注意本节提到的一些要点。
注意：如果你希望提交你的资源代理到Linux-HA资源代理库，可以参考11.3节，提交资源代理。

11.2.1 RPM包

推荐将你的资源代理打包为RPM时，使用名字 -resource-agents。确保包可以放在provider目录里面，并依赖上层目录和脚本函数。一个RPM spec 代码如下：

 
%package resource-agents
Summary: OCF resource agent for Foobar
Group: System Environment/Base
Requires: %{name} = %{version}-%{release}, resource-agents

%description resource-agents
This package contains the OCF-compliant resource agents for Foobar.

%files resource-agents
%defattr(755,root,root,-)
%dir %{_prefix}/lib/ocf/resource.d/fortytwo
%{_prefix}/lib/ocf/resource.d/fortytwo/foobar

注意：如果RPM spec 文件包含一个%package 声明，RPM会认为这是一个子包，它继承上层的Name，Version，License等上层的内容。Sub-packages 用上级包的名字跟着它们自己的名字构成。这样，上面那段代码定义的Sub-package命名为 foobar-resource-agents（假设包的名字叫foobar）。

11.2.2 Debian 包

对于Debian的包，像RPM一样，推荐使用独立的包来保存你的资源代理，但是又依赖于cluster-agents 包。

注意：这里假设使用debhelper制作deb包。

一个debian/control 代码例子如下：

Package: foobar-cluster-agents
Priority: extra
Architecture: all
Depends: cluster-agents
Description: OCF-compliant resource agents for Foobar

也可以创建一个独立的 .install 文件。按上面的例子，安装foobar资源代理，为fortytwo的一个子包，debian/fortytwo-cluster-agents .install 文件应该包含的内容如下：

usr/lib/ocf/resource.d/fortytwo/foobar

11.3 提交资源代理

如果不将资源代理作为自己项目的一部分，而上提交到资源代理软件仓库（ClusterLabs repository on GitHub），可以遵循如下步骤。

创建一个上游仓库的工作拷贝（a Git clone）

git clone git://github.com/ClusterLabs/resource-agents

然后拷贝你的资源代理到你的heartbeat子目录

cd resource-agents/heartbeat
cp /path/to/your/local/copy/of/foobar .
chmod 0755 foobar
cd ..

下一步，修改resource-agents/heartbeat目录下的Makefile.am文件，增加你的资源代理ocf_SCRIPTS列表。这可以保证你资源代理被正确的安装。

最后，在resource-agents/doc/man目录打开Makefile.am文件，并增加ocf_heartbeat_.7 到man_MANS变量。这将自动从资源代理的元数据产生资源代理的手册页，并安装在正确的位置。

现在，添加你的资源代理，和你修改的Makefile：

git add heartbeat/foobar
git add heartbeat/Makefile.am
git add doc/man/Makefile.am
git commit

提交时候的消息，最好是提供一句有意义的话。比如

git send-email --to=linux-ha-dev@lists.linux-ha.org

补丁集最好在mailing list上提供一个声明或者通知。

git send-email 会以一个良好的格式组织本次的更新并提交到邮件列表。具体细节可以参考git-send-mail的手册页。

一旦你的提交被接受，上游开发者会讲你的补丁推到上游资源库。这个时候，你可以直接从上游库更新，删掉自己的补丁集。

git reset --hard origin/master
git pull

11.4 维护资源代理

如果你为一个特定的资源代理，或者，你正对某一块代码反复贡献，最好在GitHub上开一个分支。

如果你要这么做：

创建一个GitHub帐号（如果你是新用户）
Fork resource-agents 软件仓库
克隆你创建的分支到本地工作版本

如果你开发资源代理，可以经常提交，并早提交。你可以之后经常通过 git rebase i 合并你的提交。

如果你做的提交希望一些人能看到，评审。将他们加入你的GitHub分支，并发送邮件到linux-ha-dev邮件列表，在邮件中指明这些人。

这些都做完之后，按照要求做了改变，你可以发出一个pull请求。有两种办法：

你可以使用 git request-pull 应用来获得一个预先组织好的框架，这个框架对你所做改变做了总结。你可以增加你要的信息，并将其发送的邮件列表。最好在你的邮件主题前缀定为[GIT PULL]，这样上游开发者可以比较容易获得信息。
你也可以直接给GitHub发出请求，GitHub通过邮件自动通知上游开发者。具体细节可以查看github:help。

OCF开发者指南第十章

10 测试资源代理

这一章讨论资源代理的自动测试。测试是开发的非常关键的问题，对开发新的资源代理和修改已经有的资源代理都很重要。

10.1 用ocf_tester测试

资源代理仓库（包括所有的资源代理包）都有一个应用程序叫ocf_tester.这个脚本可以方便地用来测试你的资源代理。

ocf_tester 通常以root用户调用，像这样:

ocf-tester -n <name> [-o  <param>= <value> ... ] <resource agent>

是任意的资源名字
你可以通过 -o 选项设置任何你想设置的参数 =
是你资源代理的完整路径

调用时，ocf-tester执行所有强制的操作（见第五章资源代理行为）

也可以测试可选的操作。可选操作必须符合其声明的行为，如果没有实现，ocf-tester会标一个错误标记。

重要：ocf-tester不支持操作的“干运行”，也不产生任何的资源仿制品。其运行都是和实际的资源代理一样，不论其是否打开和关闭了数据库，挂载了文件系统，启动或停止了虚拟机等等。所以，一定要小心。

比如，你可以运行对foobar资源代理使用 ocf-tester，如下：

# ocf-tester -n foobartest \
             -o superfrobnicate=true \
             -o datadir=/tmp \
             /home/johndoe/ra-dev/foobar
Beginning tests for /home/johndoe/ra-dev/foobar...
* Your agent does not support the notify action (optional)
* Your agent does not support the reload action (optional)
/home/johndoe/ra-dev/foobar passed all tests

10.2 用ocft测试

ocft也是一种资源代理测试工具。和ocf-tester不一样的是，ocft可以自动生成复杂的测试环境，包括包的安装和任意的shell脚本。

10.2.1 ocft 组件

ocft包含如下组件：

一个测试用例产生器（/usr/sbin/ocf) --- 从测试用例配置文件产生shell脚本
配置文件(/usr/share/resource-agents/ocft/configs)--- 一个配置文件包含环境变量设置和一个资源代理的配置文件
测试脚本保存在/var/lib/resource-agents/ocft/cases/,但是通常不需要去管它们

10.2.2 定制测试环境

ocft修改资源代理的运行时环境，或则是通过改变环境变量（使用OCF定义的接口）或者运行ad-hoc shell脚本，这种脚本可以改变权限或者卸载一个文件系统。

10.2.3 如何测试

你必须了解你要测试的软件，画一个所有感兴趣的运行场景的草图，标出所有期望和不期望的条件和资源代理应该出现的运行结果。然后，你需要将这些条件和期望运行结果编码成ocft的测试用列。这样运行ocft就简单了：

# ocft make <RA>
# ocft test <RA>

头一个命令产生测试用例脚本，第二个命令运行产生的脚本并检查输出。

10.2.4 ocf配置文件语法

有4个一级选型，每个一级选项下面有一个或多个子选项

CONFIG（一级选项）

这个选项是全局的，而且影响每一个测试用例；

AgentRoot （子选项）

AgentRoot /usr/lib/ocf/resource.d/xxx

正常情况下，我们假设资源代理脚本存在于heartbeat提供者的目录。使用AgentRoot来测试代理是否分布于另外的目录。

InstallPackage （子选项）

InstallPackage package [package2 [...]]

测试需要用到的包，如果包已经安装了，则不再安装。

HangTimeout（子选项）

HangTimeout secs

运行资源代理操作最大的执行时间，如果超过这个时间，操作视为失效。

SETUP-AGENT（一级选项）

SETUP-AGENT
  bash commands

如果资源代理在测试前需要初始化，可以在这里放置bash代码。初始化只完成一次。如果需要再次执行初始化，可以删除/tmp/.[AGENT_NAME]_set 标记文件。

CASE （一级选项）

CASE "description"

这是测试套件的主要构建块。每个测试用列都可以在一个CASE选项中描述。一个用例由跟着RunAgent子选项的若个子选项组成。

Var （子选项）

Var VARIABLE=value

这是为资源代理设置环境变量。通常显示为 OCF_RESKEY_xxx，注意赋值符号“=”左右都没有空格。

Unvar （子选项）

Unvar VARIABLE [VARIABLE2 [...]]

移除一个环境变量

Include （子选项）

Include macro_name

包含宏 macro_name。参考下文CASE-BLOCK。

Bash （子选项）

Bash bash_codes

这个选项设置操作系统环境变量，你可以插入BASH代码来定制系统环境。注意不要导致一些不可恢复的结果。

BashAtExit （子选项）

BashAtExit bash_codes

这个选项恢复操作系统环境，这样可以正确运行新的测试用例。当然也可以直接使用Bash选项来恢复。然而，如果进程发生错误，脚本讲不运行恢复的代码直接退出，这种情况下，应该使用BashAtExit，可以在退出前恢复系统环境。

RunAgent （子选项）

RunAgent cmd [ret_value]

这个选项运行资源代理。“cmd” 是资源代理的参数，比如"start，status，stop..."。第二个参数是可选的。可用来比较脚本运行结果的返回值和期望值。如果不同，应该可以找到一些bug。

也可以不是本地执行而是远程执行。使用的协议是ssh，程序在后台运行。只需要增加在子选项后面加上@就可以了。如下：

Bash@192.168.1.100 date

上面的例子会运行date程序。远程命令都是在后台运行的。

NB：这些还需要有多谢说明（这个话是说给编写者的：译者注）

CASE-BLOCK （一级选项）

CASE-BLOCK选项定义可以被包含在CASE中的宏。所有CASE的子选项都可以使用。

星期六, 九月 07, 2013

OCF开发者指南第九章

9 特殊考虑

9.1 Licensing

任何时候，我们都鼓励开发者使用GPL （GNU General Public License） 2.0 或者更新的版本。shell库函数不严格限制于此，它是基于LGPL( GNU Lesser General Public License)版本2.1 或更新版本的，这样non-GPL的资源的代理可以使用。

资源代理必须在源代码显式的标明其授权信息。

9.2 本地设定

当运行 .ocf-shellfuncs (4.3节初始化有说明)，资源代理自动设定 LANG 和 LC_ALL到C的区域设置。资源代理可以期待总是在C的区域设置里操作，不需要重置LANG和LC_ 环境变量。

9.3 测试运行进程

测试一个指定的进程（知道进程id）是否正在运行，通常的做法是发送一个0信号并捕获错误。比如：

if kill -s 0 `cat $daemon_pid_file`; then
    ocf_log debug "Process is currently running"
else
    ocf_log warn "Process is dead, removing pid file"
    rm -f $daemon_pid_file
if

重要：一种远优于上述做法的例子是使用一个守护进程的客户端调用一个守护进程的功能，如5.3节 monitor action的例子。

9.4 设定master特性

有状态(master/slave)资源必须设定其自己的master特性，这些特性会为集群管理提供一些信息，帮助它确定谁是最合适提升为master角色的节点。

重要：多个实例拥有相同的master特性是可以接受的。在那个例子中，集群资源管理器可以自动选择一个资源代理去提升为master角色。如果所有的实例都有缺省的master分值0的话，集群管理器不提升任何实例。这样，重要的是，至少保持一个实例的master分值为正。

为此目标，crm_master 比较方便。它封装了crm_attribute 来设置其运行节点上的属性 master-$OCF_RESOURCE_INSTANCE为一个特定的值。集群管理器会将相应实例的这些改变转换为提升分数，其提升的特征基于此。

有状态的资源代理在monitor和notity行为时执行crm_master。

下面的例子假设foobar资源代理可以通过执行一个有返回值的执行文件测试应用的状态。这个返回值取决于是否：

资源是master角色或者是slave角色(被master完全捕获），或者
资源是slave角色，但是因为异步复制的原因，落后于master，或者
资源安全的停止了，或者
资源意外地失效了

foobar_monitor() {
    local rc

    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    ocf_run frobnicate --test

    # This example assumes the following exit code convention
    # for frobnicate:
    # 0: running, and fully caught up with master
    # 1: gracefully stopped
    # 2: running, but lagging behind master
    # any other: error
    case "$?" in
        0)
            rc=$OCF_SUCCESS
            ocf_log debug "Resource is running"
            # Set a high master preference. The current master
            # will always get this, plus 1. Any current slaves
            # will get a high preference so that if the master
            # fails, they are next in line to take over.
            crm_master -l reboot -v 100
            ;;
        1)
            rc=$OCF_NOT_RUNNING
            ocf_log debug "Resource is not running"
            # Remove the master preference for this node
            crm_master -l reboot -D
            ;;
        2)
            rc=$OCF_SUCCESS
            ocf_log debug "Resource is lagging behind master"
            # Set a low master preference: if the master fails
            # right now, and there is another slave that does
            # not lag behind the master, its higher master
            # preference will win and that slave will become
            # the new master
            crm_master -l reboot -v 5
            ;;
        *)
            ocf_log err "Resource has failed"
            exit $OCF_ERR_GENERIC
    esac

    return $rc
}

星期四, 九月 05, 2013

OCF开发者指南第八章

8 惯例

这一章包含很多这些年资源代理软件仓库里显示的惯例。并非要强制作者遵循这些管理，但是遵循这些惯例，会让资源代理更容易理解和浏览（Lease Surprise 原则）

8.1 易懂的参数名

一些参数名是被好多种资源代理支持的。对于新的资源代理，遵循这些命名是比较好的：

binary ---- 管理资源的可执行码，比如服务器后台进程；
config ---- 配置文件的完整路径；
pid ---- 保存pid的文件完整路径；
log ---- 日志文件的完整路径；
socket ---- unix socket 文件的完整路径；
ip ---- 后台服务进程绑定的ip地址；
port ---- 服务器进程绑定的TCP和UDP端口

8.2 参数缺省值

缺省的资源代理参数通常以后缀 _default 的变量初始化：

 
# Defaults
OCF_RESKEY_superfrobnicate_default=0

: ${OCF_RESKEY_superfrobnicate=${OCF_RESKEY_superfrobnicate_default}}

注意：资源代理应该在元数据信息里面对没有表明为required是参数设缺省值

8.3 用$PATH环境变量去找可执行文件

当一个资源代理被设计成支持可执行文件为参数时（比如daemon，或者查询状态的客户端），那这个参数应该是在$PATH的环境变量定义的路径里可以找到的。不要提供全路径名。如下面的实现：

 
# Good example -- do it this way
OCF_RESKEY_frobnicate_default="frobnicate"
: ${OCF_RESKEY_frobnicate="${OCF_RESKEY_frobnicate_default}"}

下面的实现则不好：

 
# Bad example -- avoid if you can
OCF_RESKEY_frobnicate_default="/usr/local/sbin/frobnicate"
: ${OCF_RESKEY_frobnicate="${OCF_RESKEY_frobnicate_default}"}

缺省情况下也是这个规则。

OCF 开发者指南第七章

7 便利的函数

7.1 日志： ocf_log

资源代理应该使用ocf_log函数来进行日志。调用方法如下：

支持如下几种日志级别：

debug ---- debugging 消息。多数的日志配置缺省不使用这个级别。
info ---- 是关于资源代理行为和状态的信息
warn ---- 警告。一些不期望的行为，但是不是不可恢复的错误
err ---- 错误。这些日志在出现资源代理以一定错误码exit之前
crit ---- 临界错误。以为有err级别。crit级别就非常少用，出发资源代理也以一定的错误码退出

7.2 检查二进制：have_binary 和 check_binary

资源代理可能需要测试特定可执行文件的可用性。可以使用have_binary函数：

if ! have_binary frobnicate; then
   ocf_log warn "Missing frobnicate binary, frobnication disabled!"
fi

如果二进制文件缺失是一个致命的错误，就应该提前调用check_binary函数：

check_binary frobnicate

使用check_binary是检查文件存在（并且可执行）的一个很便捷的方法，如果没找到或不可执行的话，以$OCF_ERR_INSTALLED退出。

注意：have_binary和check_binary 会在$PATH 定义的目录下去找指定的二进制文件。建议最好是做全路径检查，分布式环境里面，用户策略的变化都会使得文件路径反生变化。

7.3 执行命令并收集输出：ocf_run

任何时候需要执行一个命令并获得其输出都可以使用ocf_run函数，像下面的例子这样:

ocf_run "frobnicate --spam=eggs" || exit $OCF_ERR_GENERIC

通过上面的命令，资源代理会调用 frobnicate --spam=eggs 并捕捉其输出和退出码。如果退出码非零（代表一种错误），ocf_run 以 err级别将输出记录的日志，资源代理随后退出。如果退出码是0（表示成功），则输出将以info级别记录到日志。

如果资源代理希望忽略成功执行的输出，可以使用 -q 参数。在下面的例子中，ocf_run 只会将退出码非零的执行输出结果导入日志。

ocf_run -q "frobnicate --spam=eggs" || exit $OCF_ERR_GENERIC

最后，如果资源代理想以不同的日志级别（非 err ）记录那些返回码非零的执行结果，可以通过使用 -info 或者 -warn参数：

ocf_run -warn "frobnicate --spam=eggs"

7.4 锁：ocf_take_lock 和 ocf_release_lock_on_exit

偶然也会出现这种情况，按集群配置，相同类型的不同资源不能同时并行运行。资源代理需要保证在相同机器上不会并行:

LOCKFILE=${HA_RSCTMP}/foobar
ocf_release_lock_on_exit $LOCKFILE

foobar_start() {
    ...
    ocf_take_lock $LOCKFILE
    ...
}

ocf_take_lock 试图获得指定的$LOCKFILE。不可得时，它sleep 0到1秒间一个随机的时间，再重试。ocf_release_lock_on_exit 会在资源代理退出的时候释放锁文件。

7.5 数值检测

对于参数检测，这个函数可以用来检测一个值是否是一个数。例子如下：

LOCKFILE=${HA_RSCTMP}/foobar
ocf_release_lock_on_exit $LOCKFILE

foobar_validate_all() {
    if ! ocf_is_decimal $OCF_RESKEY_eggs; then
        ocf_log err "eggs is not numeric!"
        exit $OCF_ERR_CONFIGURED
    fi
    ...
}

7.6 boolean值检测

当资源代理定义一个boolean类型的参数，用户会将这个参数指定为0/1, true/false, 或者 on/off。在资源代理里面，去检测这些值是很麻烦的，所以应该用ocf_is_true函数比较方便：

if ocf_is_true $OCF_RESKEY_superfrobnicate; then
    ocf_run "frobnicate --super"
fi

注意：ocf_is_true 不可以对空的或者不存在的变量使用。它总是返回退出码 1，相当于 false。

7.7 伪资源：ha_pseudo_resource

伪资源是那种不像实际可运行进程一样可以真实启动和停止的资源代理，它们仅仅执行单个的行为，并以某种方式跟踪这个行为执行了还是没有。portblock资源代理是这样的一个例子。

伪资源的资源代理可以使用 ha_pseudo_resource, 它石永红tracking file（跟踪文件）开保存资源状态的标签。如果foobar是为资源的资源代理，其start的行为应该像下面这个样子：

foobar_start() {
    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    # if resource is already running, bail out early
    if foobar_monitor; then
        ocf_log info "Resource is already running"
        return $OCF_SUCCESS
    fi

    # start the pseudo resource
    ha_pseudo_resource ${OCF_RESOURCE_INSTANCE} start

    # After the resource has been started, check whether it started up
    # correctly. If the resource starts asynchronously, the agent may
    # spin on the monitor function here -- if the resource does not
    # start up within the defined timeout, the cluster manager will
    # consider the start action failed
    while ! foobar_monitor; do
        ocf_log debug "Resource has not started yet, waiting"
        sleep 1
    done

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS
}

星期二, 九月 03, 2013

OCF开发者指南第六章

6 脚本变量

这一章列出资源代理常用的环境变量，重要是为了方便。更多的变量，可以参考2.1节环境变量和第三章返回码。

6.1 $OCF_ROOT

OCF 资源代理的层次结构的根。不可以通过资源代理改变。通常是/usr/lib/ocf

6.2 $OCF_FUNCTIONS_DIR

这是资源代理shell函数库所在的目录， .ocf_shellfuncs 所在的地方。通常会用$OCF_ROOT定义，并且资源代理不应更改。然而，测试一个资源代理的时候，这个变量可能通过命令行重载。

6.3 $OCF_RESOURCE_INSTANCE

资源实例名字。对于基本的资源（非clone，非状态资源），变量就是资源名字。对于克隆和有状态资源，这个变量是一个基本名字跟一个冒号，再跟一个克隆实例序号。

6.4 $__OCF_ACTION

当前调用的操作。这个就是集群管理器调用资源代理时使用的第一个命令行参数

6.5 $__OCF_SCRIPT_NAME

资源代理的名字。这个就是资源代理的名字，是不包含前面目录名的。

6.6 $HA_RSCTMP

资源代理使用的临时目录。系统启动顺序会保证这个目录在系统启动时是空的（在任何LSB兼容的linux发行版都是这样），一个节点启动后，这个目录不包含任何数据。

OCF开发者指南第五章

5 资源代理行为

每一个行为通常都使用一个分开的函数或者方法来实现。为了方便，通常命名为<agent>_<action>，所以，foobar的start行为的函数实现命名为 foobar_start().

按照通用的规则，任何时候资源代理遇到一个不可恢复的错误，资源代理可以马上退出，抛出异常，或者退出执行。这种情况往往发生在配置错误，缺失二进制文件，权限问题等时候。不必将这些错误传递到调用栈。

集群管理器有责任根据用户的配置实行合适的恢复行为。资源代理在有明确的配置说明时，不可以去猜的。

5.1 start action

当调用资源的start操作时，资源代理必须启动资源，除非资源已经启动了。这意味着资源代理必须确认资源的配置，查询他的状态，并只在资源没有启动的情况下才启动资源。通常的做法是首先调用validate_all 和 monitor 函数，如下面的例子：

 
foobar_start() {
    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    # if resource is already running, bail out early
    if foobar_monitor; then
        ocf_log info "Resource is already running"
        return $OCF_SUCCESS
    fi

    # actually start up the resource here (make sure to immediately
    # exit with an $OCF_ERR_ error code if anything goes seriously
    # wrong)
    ...

    # After the resource has been started, check whether it started up
    # correctly. If the resource starts asynchronously, the agent may
    # spin on the monitor function here -- if the resource does not
    # start up within the defined timeout, the cluster manager will
    # consider the start action failed
    while ! foobar_monitor; do
        ocf_log debug "Resource has not started yet, waiting"
        sleep 1
    done

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS
}

5.2 stop action

当调用stop行为时，如果资源正在运行资源代理必须停止资源。这意味着，资源代理必须检测资源配置，查询其状态，在其正常运行的情况下，则stop。通常的做法是先调用validate_all 和 monitor 函数。必须清楚的是，stop是一个强制操作----资源代理可以做任何事情来关闭，重启动或切断资源。看下面的例子：

foobar_stop() {
    local rc

    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    foobar_monitor
    rc=$?
    case "$rc" in
        "$OCF_SUCCESS")
            # Currently running. Normal, expected behavior.
            ocf_log debug "Resource is currently running"
            ;;
        "$OCF_RUNNING_MASTER")
            # Running as a Master. Need to demote before stopping.
            ocf_log info "Resource is currently running as Master"
            foobar_demote || \
                ocf_log warn "Demote failed, trying to stop anyway"
            ;;
        "$OCF_NOT_RUNNING")
            # Currently not running. Nothing to do.
            ocf_log info "Resource is already stopped"
            return $OCF_SUCCESS
            ;;
    esac

    # actually shut down the resource here (make sure to immediately
    # exit with an $OCF_ERR_ error code if anything goes seriously
    # wrong)
    ...

    # After the resource has been stopped, check whether it shut down
    # correctly. If the resource stops asynchronously, the agent may
    # spin on the monitor function here -- if the resource does not
    # shut down within the defined timeout, the cluster manager will
    # consider the stop action failed
    while foobar_monitor; do
        ocf_log debug "Resource has not stopped yet, waiting"
        sleep 1
    done

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS

}

注意：stop行为运行成功的返回码是 $OCF_SUCCESS，不是 $OCF_NOT_RUNNING

重要：stop行为失败会造成潜在的危险，集群管理器总是试着通过fencing来解决这个问题。换句话说，就是强制将一个节点从集群中剔除。这种方法最终是为了保护数据，但是的确让用户应用中断。所以，资源代理返回错误一定要非常慎重，确保合适合理的资源关闭方法都已经使用了。

5.3 monitor action

monitor 行为查询资源的状态。必须明确下面三种状态：

资源正在运行（返回 $OCF_SUCCESS）
资源安全的关闭（返回 $OCF_NOT_RUNNING)
资源运行出现问题，判断为一种错误（返回最接近的那个 $OCF_ERR_ 来指明问题）

   
foobar_monitor(){
 local rc

    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    ocf_run frobnicate --test

    # This example assumes the following exit code convention
    # for frobnicate:
    # 0: running, and fully caught up with master
    # 1: gracefully stopped
    # any other: error
    case "$?" in
        0)
            rc=$OCF_SUCCESS
            ocf_log debug "Resource is running"
            ;;
        1)
            rc=$OCF_NOT_RUNNING
            ocf_log debug "Resource is not running"
            ;;
        *)
            ocf_log err "Resource has failed"
            exit $OCF_ERR_GENERIC
    esac

    return $rc
}

有状态的（master/slave) 资源代理则需要另外一种精心定制的monitoring模式，这种模式可以提示集群管理器哪一个实例最合适做Master节点。第9.4节《确定master特征》会解释细节。

注意：集群管理器的probe是测试资源是否运行的，会调用monitor行为。正常情况下，monitor操作在被probe调用和直接运行时是一样的。如果有些特别的资源需要特别定义probe，ocf_is_probe函数就是为这个目的的。

5.4 validate-all action validate-all 行为测试资源代理的配置和工作环境。validate-all 退出会返回如下值：

$OCF_SUCCESS ---- 一切正常，配置正常可用；
$OCF_ERR_CONFIGURED ---- 资源配置出错；
$OCF_ERR_INSTALLED ---- 资源可能配置正确，但是在validate-all执行的节点，可能有关键组件丢失；
$OCF_ERR_PERM ---- 资源配置正确，也不缺组件，但是可能有权限问题（比如无法创建必要的文件）。

validate-all 通常封装成一个函数，不单是在相应行为时显式的调用，也可以由其他函数调用。所以，开发者一定要记得：这个函数也可能会在start，stop和monitor行为时候调用。

Probes 也引出了另外一个对于校验的挑战。在probe时（当集群管理器可能期望资源不要运行在probe运行的节点上），可能期望一些需要的组件在受影响的节点上是不可得的。比如，在probe时，期望在存储设备上的共享数据不可读。validate-all 函数可能需要特别对待probe，可以使用ocf_is_probe函数。

foobar_validate_all() {
    # Test for configuration errors first
    if ! ocf_is_decimal $OCF_RESKEY_eggs; then
       ocf_log err "eggs is not numeric!"
       exit $OCF_ERR_CONFIGURED
    fi

    # Test for required binaries
    check_binary frobnicate

    # Check for data directory (this may be on shared storage, so
    # disable this test during probes)
    if ! ocf_is_probe; then
       if ! [ -d $OCF_RESKEY_datadir ]; then
          ocf_log err "$OCF_RESKEY_datadir does not exist or is not a directory!"
          exit $OCF_ERR_INSTALLED
       fi
    fi

    return $OCF_SUCCESS
}

5.5 meta-data action

meta_data 操作导出资源代理元数据到标准输出。输出必须遵循元数据格式----在2.4节有说明。

foobar_meta_data {
    cat <<EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="foobar" version="0.1">
  <version>0.1</version>
  <longdesc lang="en">
...
EOF
}

5.6 promote action

promote操作是可选的。它只支持有状态的资源代理，就是说，资源代理必须是两种角色中的一种：Master和slave。slave角色功能上和无状态的资源代理是相同的。这样，标准的无状态资源代理仅仅需要实现start和stop操作，而且有状态的资源代理必须实现started（slave）和master角色的切换。

foobar_promote() {
    local rc

    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    # test the resource's current state
    foobar_monitor
    rc=$?
    case "$rc" in
        "$OCF_SUCCESS")
            # Running as slave. Normal, expected behavior.
            ocf_log debug "Resource is currently running as Slave"
            ;;
        "$OCF_RUNNING_MASTER")
            # Already a master. Unexpected, but not a problem.
            ocf_log info "Resource is already running as Master"
            return $OCF_SUCCESS
            ;;
        "$OCF_NOT_RUNNING")
            # Currently not running. Need to start before promoting.
            ocf_log info "Resource is currently not running"
            foobar_start
            ;;
        *)
            # Failed resource. Let the cluster manager recover.
            ocf_log err "Unexpected error, cannot promote"
            exit $rc
            ;;
    esac

    # actually promote the resource here (make sure to immediately
    # exit with an $OCF_ERR_ error code if anything goes seriously
    # wrong)
    ocf_run frobnicate --master-mode || exit $OCF_ERR_GENERIC

    # After the resource has been promoted, check whether the
    # promotion worked. If the resource promotion is asynchronous, the
    # agent may spin on the monitor function here -- if the resource
    # does not assume the Master role within the defined timeout, the
    # cluster manager will consider the promote action failed.
    while true; do
        foobar_monitor
        if [ $? -eq $OCF_RUNNING_MASTER ]; then
            ocf_log debug "Resource promoted"
            break
        else
            ocf_log debug "Resource still awaiting promotion"
            sleep 1
        fi
    done

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS
}

5.7 demote action

promote操作是可选的。它只支持有状态的资源代理，就是说，资源代理必须是两种角色中的一种：Master和slave。slave角色功能上和无状态的资源代理是相同的。这样，标准的无状态资源代理仅仅需要实现start和stop操作，而且有状态的资源代理必须实现master和started（slave）角色的切换。

foobar_demote() {
    local rc

    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    # test the resource's current state
    foobar_monitor
    rc=$?
    case "$rc" in
        "$OCF_RUNNING_MASTER")
            # Running as master. Normal, expected behavior.
            ocf_log debug "Resource is currently running as Master"
            ;;
        "$OCF_SUCCESS")
            # Alread running as slave. Nothing to do.
            ocf_log debug "Resource is currently running as Slave"
            return $OCF_SUCCESS
            ;;
        "$OCF_NOT_RUNNING")
            # Currently not running. Getting a demote action
            # in this state is unexpected. Exit with an error
            # and let the cluster manager recover.
            ocf_log err "Resource is currently not running"
            exit $OCF_ERR_GENERIC
            ;;
        *)
            # Failed resource. Let the cluster manager recover.
            ocf_log err "Unexpected error, cannot demote"
            exit $rc
            ;;
    esac

    # actually demote the resource here (make sure to immediately
    # exit with an $OCF_ERR_ error code if anything goes seriously
    # wrong)
    ocf_run frobnicate --unset-master-mode || exit $OCF_ERR_GENERIC

    # After the resource has been demoted, check whether the
    # demotion worked. If the resource demotion is asynchronous, the
    # agent may spin on the monitor function here -- if the resource
    # does not assume the Slave role within the defined timeout, the
    # cluster manager will consider the demote action failed.
    while true; do
        foobar_monitor
        if [ $? -eq $OCF_RUNNING_MASTER ]; then
            ocf_log debug "Resource still awaiting promotion"
            sleep 1
        else
            ocf_log debug "Resource demoted"
            break
        fi
    done

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS
}

5.8 migrate_to action

migrate_to 操作服务于下面两个目的中的一个：

为资源提供一种本地push方式的迁移发起过程。换句话说，指导资源从当前运行的地方迁移到指定节点。资源代理通过环境变量 $OCF_RESKEY_CRM_meta_migrate_target 获得目标节点。
在freeze/thaw（或suspend/resume）模式的迁移中冻住资源，这种模式下资源不需要知道目的地。

下面的例子描述了push类型的迁移：

foobar_migrate_to() {
    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    # if resource is not running, bail out early
    if ! foobar_monitor; then
        ocf_log err "Resource is not running"
        exit $OCF_ERR_GENERIC
    fi

    # actually start up the resource here (make sure to immediately
    # exit with an $OCF_ERR_ error code if anything goes seriously
    # wrong)
    ocf_run frobnicate --migrate \
                       --dest=$OCF_RESKEY_CRM_meta_migrate_target \
                       || exit OCF_ERR_GENERIC
    ...

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS
}

相应的，freeze/thaw 类型的迁移可以按如下方法实现freeze操作：

 
foobar_migrate_to() {
    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    # if resource is not running, bail out early
    if ! foobar_monitor; then
        ocf_log err "Resource is not running"
        exit $OCF_ERR_GENERIC
    fi

    # actually start up the resource here (make sure to immediately
    # exit with an $OCF_ERR_ error code if anything goes seriously
    # wrong)
    ocf_run frobnicate --freeze || exit OCF_ERR_GENERIC
    ...

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS
}

5.9 migrate_from action

migrate_from 操作服务于下面两个目的中的一个：

为资源提供一种本地push方式的迁移完成过程。换句话说，检查资源是否正确的迁移，并在本地运行起来了。资源代理通过环境变量 OCF_RESKEY_CRM_meta_migrate_source 获得源节点
在freeze/thaw（或suspend/resume）模式的迁移中解冻资源，这种模式下资源不需要知道源地址

下面的例子描述了push类型的迁移：

 
foobar_migrate_from() {
    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    # After the resource has been migrated, check whether it resumed
    # correctly. If the resource starts asynchronously, the agent may
    # spin on the monitor function here -- if the resource does not
    # run within the defined timeout, the cluster manager will
    # consider the migrate_from action failed
    while ! foobar_monitor; do
        ocf_log debug "Resource has not yet migrated, waiting"
        sleep 1
    done

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS
}

相应的，freeze/thaw 类型的迁移可以按如下方法实现thaw操作：

 
foobar_migrate_from() {
    # exit immediately if configuration is not valid
    foobar_validate_all || exit $?

    # actually start up the resource here (make sure to immediately
    # exit with an $OCF_ERR_ error code if anything goes seriously
    # wrong)
    ocf_run frobnicate --thaw || exit OCF_ERR_GENERIC

    # After the resource has been migrated, check whether it resumed
    # correctly. If the resource starts asynchronously, the agent may
    # spin on the monitor function here -- if the resource does not
    # run within the defined timeout, the cluster manager will
    # consider the migrate_from action failed
    while ! foobar_monitor; do
        ocf_log debug "Resource has not yet migrated, waiting"
        sleep 1
    done

    # only return $OCF_SUCCESS if _everything_ succeeded as expected
    return $OCF_SUCCESS
}

5.10 notify action

通过通知，clone的实例（包括master/slave 资源，这种资源是clone资源的一种扩展）可以相互通知各自的状态。当通知机制被启用，每一个克隆实例都会携带 pre 和 post 通知。然后，集群管理器对所有克隆实例调用notify操作。notify操作执行是，会用到如下附加的环境变量：

$OCF_RESKEY_CRM_meta_notify_type — 通知类型 (pre 或 post)
$OCF_RESKEY_CRM_meta_notify_operation — 操作 (action)，这是指通知做什么(start, stop, promote, demote 等.)
$OCF_RESKEY_CRM_meta_notify_start_uname — 资源启动所在的节点名字(仅仅对启动通知)
$OCF_RESKEY_CRM_meta_notify_stop_uname — 资源停止所在的节点名字(仅仅对停止通知)
$OCF_RESKEY_CRM_meta_notify_master_uname — Master 角色资源运行所在节点的名字
$OCF_RESKEY_CRM_meta_notify_promote_uname — 正提升为Master角色的资源所在节点的节点名字 (仅仅 promote 通知)
$OCF_RESKEY_CRM_meta_notify_demote_uname —正在降级为slave角色的资源所在节点的节点名字 (仅仅 demote 通知)

对于master/slave资源，使用push模式的通知是很便利的，在种模式下，master为发布者，slave为订阅者。既然master只有在提升为master时能发通知，那slave就可以利用一个pre-promote通知来配置他们自己指向正确的发布者。

同样的，订阅者也希望在master角色状态不再延续时取消订阅。post-demote通知就是为了这个目的。

下面的例子阐述这样的概念：

foobar_notify() {
    local type_op
    type_op="${OCF_RESKEY_CRM_meta_notify_type}-${OCF_RESKEY_CRM_meta_notify_operation}"

    ocf_log debug "Received $type_op notification."
    case "$type_op" in
        'pre-promote')
            ocf_run frobnicate --slave-mode \
                               --master=$OCF_RESKEY_CRM_meta_notify_promote_uname \
                               || exit $OCF_ERR_GENERIC
            ;;
        'post-demote')
            ocf_run frobnicate --unset-slave-mode || exit $OCF_ERR_GENERIC
            ;;
    esac

    return $OCF_SUCCESS
}

注意:master/slave资源代理可支持多master配置，这样可能在某个时间内不止一个master。这种情况下，$OCF_RESKEY_CRM_meta_notify_*_uname会包含一个空格分隔的机器名列表，而不是上面例子一样的一个机器名。在那种环境里面，资源代理应该去处理一下这个列表。

订阅：评论 (Atom)