暂停带有从属 hacluster 的 Charm，避免发送错误警报¶

总体目标是留下“警告”警报，而不是“严重”警报，这将帮助人工操作员理解所有服务并非完全健康，同时减少由于正在进行的操作引起的关键性。当维护操作下的服务恢复正常（恢复）时，Nrpe 检查将被重新配置。

暂停/恢复单元时将应用以下逻辑

暂停主单元，暂停从属 hacluster；
恢复主单元，恢复从属 hacluster；
暂停 hacluster 单元，暂停主单元；
恢复 hacluster 单元，恢复主单元；

问题描述¶

我们需要停止在 Openstack charm 单元的 hacluster 从属单元暂停时，或者主单元也暂停进行维护时发送错误警报。这可能有助于运维接收更有效的警报。

有几个使用 hacluster 和 NRPE 的 charm 可以从中受益

charm-ceilometer
charm-ceph-radosgw
charm-designate
charm-keystone
charm-neutron-api
charm-nova-cloud-controller
charm-openstack-dashboard
charm-cinder
charm-glance
charm-heat
charm-swift-proxy

暂停主单元¶

例如，如果部署了 3 个 keystone 单元（keystone/0、keystone/1 和 keystone/2），并且 keystone/0 被暂停

1) 其他单元（keystone/1 和 keystone/2）上的 haproxy_servers 将发出警报，因为 keystone/0 上的 apache2 服务已停止

keystone/0 中的 haproxy、apache2.service 和 memcached.service 也会发出警报

3) 有可能 corosync 和 pacemaker 将 VIP 放置在同一个单元上，此时服务将失败，因为 haproxy 已禁用。因此，hacluster 从属单元也应暂停。

注意：暂停主单元时受影响的服务可能会因主 charm 而异

暂停 hacluster 单元¶

暂停 hacluster 将集群节点（例如 keystone）设置为待机模式。待机节点将停止其资源（hacluster、apache2），这将触发错误警报。为了解决此问题，hacluster 的单元应告知 keystone 单元它们已暂停。一种方法是通过 ha 关系来实现。

提议的变更¶

暂停暂停主单元¶

主单元上的暂停操作应与其对等单元共享该事件，以修改它们上的行为（直到触发恢复操作）。它还应将状态（暂停/恢复）共享给从属单元，以便能够同步相同的状态。

主单元中的 actions.py 文件

def pause(args):
    pause_unit_helper(register_configs())

    # Logic added to share the event with peers
    inform_peers_if_ready(check_api_unit_ready)
    if is_nrpe_joined():
      update_nrpe_config()

    # logic added to inform hacluster subordinate unit has been paused
    relid = relation_ids('ha')
    for r_id in relid:
        relation_set(relation_id=r_id, paused=True)

def resume(args):
    resume_unit_helper(register_configs())

    # Logic added to share the event with peers
    inform_peers_if_ready(check_api_unit_ready)
    if is_nrpe_joined():
      update_nrpe_config()

    # logic added to inform hacluster subordinate unit has been resumed
    relid = relation_ids('ha')
    for r_id in relid:
        relation_set(relation_id=r_id, paused=False)

暂停主单元后，它会将 unit-state-{unit_name} 更改为 NOTREADY。例如

juju show-unit keystone/0 --endpoint cluster
keystone/0:
  workload-version: 17.0.0
  machine: "1"
  opened-ports:
  - 5000/tcp
  public-address: 10.5.2.64
  charm: cs:~openstack-charmers-next/keystone-562
  leader: true
  relation-info:
  - endpoint: cluster
    related-endpoint: cluster
    application-data: {}
    local-unit:
      in-scope: true
      data:
        admin-address: 10.5.2.64
        egress-subnets: 10.5.2.64/32
        ingress-address: 10.5.2.64
        internal-address: 10.5.2.64
        private-address: 10.5.2.64
        public-address: 10.5.2.64
        unit-state-keystone-0: NOTREADY

注意：unit-state-{unit_name} 字段已经实现，我只是建议使用此字段，并在单元暂停时将其值更改为 NOTREADY，并在恢复时返回到 READY。

当每个单元都知道哪个单元已暂停时，可以更改脚本 check_haproxy.sh 以接受一个标志，以警告已暂停的 keystone 单元。bash 脚本现在无法接收标志。

Check_haproxy.sh 可以从 Bash 重写为 Python，以接受一个标志来警告特定的主机名（例如 check_haproxy.py –warning keystone-0）正在维护中。

charmhelpers/contrib/charmsupport 上的 nrpe.py 文件应进行更改，首先检查集群中是否有任何已暂停的单元，然后根据需要添加警告标志

def add_haproxy_checks(nrpe, unit_name):
    """
    Add checks for each service in list

    :param NRPE nrpe: NRPE object to add check to
    :param str unit_name: Unit name to use in check description
    """
    cmd = "check_haproxy.py"

    peers_states = get_peers_unit_state()
    units_not_ready = [
        unit.replace('/', '-')
        for unit, state in peers_states.items()
        if state == UNIT_NOTREADY
    ]

    if is_unit_paused_set():
        units_not_ready.append(local_unit().replace('/', '-'))

    if units_not_ready:
        cmd += " --warning {}".format(','.join(units_not_ready))

    nrpe.add_check(
        shortname='haproxy_servers',
        description='Check HAProxy {%s}' % unit_name,
        check_cmd=cmd)
    nrpe.add_check(
        shortname='haproxy_queue',
        description='Check HAProxy queue depth {%s}' % unit_name,
        check_cmd='check_haproxy_queue_depth.sh')

当主单元更改状态（例如 READY 到 NOTREADY）时，有必要重写集群中其他主单元上的 nrpe 文件，否则它们将无法警告某个单元正在维护中。

负责经典 charm 中钩子的文件

@hooks.hook('cluster-relation-changed')
@restart_on_change(restart_map(), stopstart=True)
def cluster_changed():
    # logic added to update nrpe_config in all principal units when
    # a status is changed
    update_nrpe_config()

注意：在 reactive charm 中，使用处理程序可能会略有不同，但基本思想是在每次集群中的配置发生更改时更新_nrpe_config。这将防止集群中其他单元发出错误警报。

主单元的服务¶

在单元暂停时删除 /etc/nagios/nrpe.d 中的 .cfg 文件，对于这些服务将停止发送严重错误。这种方法的缺点是它不会在 Nagios 中显示用户友好的消息，说明特定的服务（apache2、memcached 等）正在维护中，但另一方面，它更容易实现。

负责经典 charm 中钩子的文件

@hooks.hook('nrpe-external-master-relation-joined',
            'nrpe-external-master-relation-changed')
def update_nrpe_config():
    # logic before change
    # ...

    nrpe_setup = nrpe.NRPE(hostname=hostname)
    nrpe.copy_nrpe_checks()

    # added logic to remove services
    if is_unit_paused_set():
        nrpe.remove_init_service_checks(
            nrpe_setup,
            _services,
            current_unit
        )

    else:
        nrpe.add_init_service_checks(
            nrpe_setup,
            _services,
            current_unit
        )

    # end of added logic

    nrpe.add_haproxy_checks(nrpe_setup, current_unit)
    nrpe_setup.write()

下面介绍了新的逻辑来删除这些服务。

charmhelpers/contrib/charmsupport/nrpe.py 文件

# added logic to remove apache2, memcached and etc...
def remove_init_service_checks(nrpe, services, unit_name):
    for svc in services:
        if host.init_is_systemd(service_name=svc):
            nrpe.remove_check(
                shortname=svc,
                description='process check {%s}' % unit_name,
                check_cmd='check_systemd.py %s' % svc
            )

服务状态在几分钟后将从 nagios 中消失。当使用恢复操作时，服务最初会恢复为 PENDING，但在几分钟后会进行检查。

暂停 hacluster 单元¶

charm-hacluster 中的 actions.py 文件

def pause(args):
    """Pause the hacluster services.
    @raises Exception should the service fail to stop.
    """
    pause_unit()
    # logic added to inform keystone that unit has been paused
    relid = relation_ids('ha')
    for r_id in relid:
        relation_set(relation_id=r_id, paused=True)


def resume(args):
    """Resume the hacluster services.
    @raises Exception should the service fail to start."""
    resume_unit()
    # logic added to inform keystone that unit has been resumed
    relid = relation_ids('ha')
    for r_id in relid:
        relation_set(relation_id=r_id, paused=False)

暂停 hacluster 将导致共享一个名为 paused 的新变量，该变量可用于主单元。

负责经典 charm 中钩子的文件

@hooks.hook('ha-relation-changed')
@restart_on_change(restart_map(), restart_functions=restart_function_map())
def ha_changed():

    # Added logic to pause keystone unit when hacluster is paused
    for rid in relation_ids('ha'):
        for unit in related_units(rid):
            paused = relation_get('paused', rid=rid, unit=unit)
            clustered = relation_get('clustered', rid=rid, unit=unit)
            if clustered and is_db_ready():
                if paused == 'True':
                    pause_unit_helper(register_configs())

                elif paused == 'False':
                    resume_unit_helper(register_configs())

                update_nrpe_config()
                inform_peers_if_ready(check_api_unit_ready)
                # inform subordinate unit that is paused or resumed
                relation_set(relation_id=rid, paused=is_unit_paused_set())

通过告知对等单元并更新 nrpe 配置，这将足以触发必要的逻辑来删除服务检查。

在主单元暂停的情况下，hacluster 也应暂停。为了实现这一点，可以使用 charm-ha-cluster 中的 ha-relation-changed

@hooks.hook('ha-relation-joined',
            'ha-relation-changed',
            'peer-availability-relation-joined',
            'peer-availability-relation-changed',
            'pacemaker-remote-relation-changed')
def ha_relation_changed():
    # Inserted logic
    # pauses if the principal unit is paused
    paused = relation_get('paused')
    if paused == 'True':
        pause_unit()
    elif paused == 'False':
        resume_unit()

    # share if the subordinate unit status
    for rel_id in relation_ids('ha'):
        relation_set(
            relation_id=rel_id,
            clustered="yes",
            paused=is_unit_paused_set()
        )

备选方案¶

主单元服务检查的一种替代方法是更改 charm-nrpe 中的 systemd.py 以接受标志 -w，就像 check_haproxy.py 的建议一样

这样就不需要删除主单元服务的 .cfg 文件，但需要调整函数 add_init_service_checks 以能够接受带有警告标志的服务。

实现¶

负责人¶

主要负责人: gabrielcocenza

Gerrit Topic¶

使用 Gerrit 主题“pausing-charms-hacluster-no-false-alerts”用于与此规范相关的所有补丁。

git-review -t pausing-charms-hacluster-no-false-alerts

工作项¶

charmhelpers
- nrpe.py
- check_haproxy.py
charm-ceilometer
charm-ceph-radosgw
charm-designate
charm-keystone
charm-neutron-api
charm-nova-cloud-controller
charm-openstack-dashboard
charm-cinder
charm-glance
charm-heat
charm-swift-proxy
charm-nrpe (替代方案)
- systemd.py
charm-hacluster
- actions.py

仓库¶

不需要新的 git 仓库。

文档¶

有必要记录暂停/恢复从属 hacluster 的影响以及对 Openstack API charm 的副作用。

安全性¶

没有额外的安全问题。

测试¶

代码更改将通过单元测试和功能测试覆盖。对于功能测试，它将使用包含 keystone、hacluster、nrpe 和 nagios 的 bundle。

依赖项¶

无

暂停带有下属 hacluster 的 Charms，而不会发送错误警报

暂停带有从属 hacluster 的 Charm，避免发送错误警报¶

问题描述¶

暂停主单元¶

暂停 hacluster 单元¶

提议的变更¶

暂停暂停主单元¶

主单元的服务¶

暂停 hacluster 单元¶

备选方案¶

实现¶

负责人¶

Gerrit Topic¶

工作项¶

仓库¶

文档¶

安全性¶

测试¶

依赖项¶

Charm Specs 0.0.1.dev214

页面内容