处理重塑的提供者树¶

https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree

虚拟驱动程序需要能够改变它们暴露的提供者树的结构。在迁移现有资源时，现有的分配需要与库存一起移动。并且必须以避免竞争条件的方式进行，在这种情况下，第二个实体可能会对正在移动的库存创建或删除分配。

问题描述¶

用例¶

libvirt 驱动程序当前在计算节点提供者上清点 VGPU 资源。为了利用提供者树，libvirt 需要为每个物理 GPU 创建一个子提供者，并将 VGPU 库存从计算节点提供者移动到这些 GPU 子提供者。在已经将 VGPU 资源分配给实例的实时部署中，分配需要与库存一起移动。
希望建模 NUMA 的驱动程序同样需要创建子提供者，并将多个类别的库存和分配（处理器、内存、与 NUMA 对齐的 NIC 上的 VFs 等）移动到这些提供者。
驱动程序正在使用自定义资源类。该资源类被添加到标准集合中（使用一个新的、非 CUSTOM_ 名称）。为了使用标准名称，驱动程序必须将库存和分配从旧名称移动到新名称。

这些只是现在或将来可能存在的示例情况。我们在这里描述的是一个通用的枢轴系统。

提议的变更¶

总体流程如下。红色部分仅在需要重塑时发生。这仅代表计算启动时的理想路径。

../../../_images/reshape-provider-tree.svg

请注意，对于快速前进升级，Resource Tracker 通道实际上是离线升级脚本。

SchedulerReportClient.get_allocations_for_provider_tree()¶

将实现一个新的 SchedulerReportClient 方法

def get_allocations_for_provider_tree(self):
    """Retrieve allocation records associated with all providers in the
    provider tree.

    :returns: A dict, keyed by consumer UUID, of allocation records.
    """

消费者不总是实例（它可能是“迁移”——或者将来其他不由 Nova 创建的事物），所以我们不能仅仅使用实例列表作为消费者列表。

我们无法获取与关联共享提供者所有的分配，因为其中一些分配将属于其他主机的消费者。

因此，我们必须发现与本地树中的提供者关联的所有消费者

for each "local" provider:
    GET /resource_providers/{provider.uuid}/allocations

我们不能仅仅使用这些分配，因为我们会错过共享提供者的分配。因此，我们必须获取仅针对上述发现的消费者所做的所有分配

for each consumer in ^:
    GET /allocations/{consumer.uuid}

注意

如果所有消费者的分配都存在于共享提供者上，我们仍然会丢失数据。我没有找到一个好的方法来解决这个问题。但这种场景在短期内不会发生，因此它将作为代码注释中的一个限制记录下来。

返回一个字典，键为 {consumer.uuid}，包含结果分配记录。这是 update_provider_tree() 和 update_from_provider_tree() 期望的新分配参数的形式，并返回它。

ReshapeNeeded 异常¶

将引入一个新的异常，ReshapeNeeded。它用作 update_provider_tree() 发出的信号，表明必须执行重塑。这是为了性能原因，这样我们就不需要在不需要时 get_allocations_for_provider_tree()。

对 update_provider_tree() 的更改¶

分配参数¶

将向 update_provider_tree() 添加一个新的 allocations 关键字参数

def update_provider_tree(self, provider_tree, nodename, allocations=None):

如果为 None，则 upgrade_provider_tree() 方法不得执行重塑。如果它决定需要重塑，则必须引发新的 ReshapeNeeded 异常。

当不为 None 时，allocations 参数是一个字典，键为消费者 UUID，包含分配记录的形式为

{ $CONSUMER_UUID: {
      # NOTE: The shape of each "allocations" dict below is identical to the
      # return from GET /allocations/{consumer_uuid}...
      "allocations": {
          $RP_UUID: {
              "generation": $RP_GEN,
              "resources": {
                  $RESOURCE_CLASS: $AMOUNT,
                  ...
              },
          },
          ...
      },
      "project_id": $PROJ_ID,
      "user_id": $USER_ID,
      # ...except for this, which is coming in bp/add-consumer-generation
      "consumer_generation": $CONSUMER_GEN,
  },
  ...
}

如果 update_provider_tree() 正在移动分配，则必须就地编辑 allocations 字典。

注意

我不喜欢该方法就地编辑字典而不是返回副本的想法，但它与我们处理 provider_tree 参数的方式一致。

虚拟驱动程序¶

当前覆盖 update_provider_tree() 的虚拟驱动程序需要更改签名以适应新参数。这项工作将在本蓝图的范围内完成。

当虚拟驱动程序开始在嵌套提供者中建模资源时，它们的实现需要

确定是否需要重塑，并根据需要引发 ReshapeNeeded；
通过处理提供者库存和指定的分配来执行重塑。

这项工作不在本蓝图的范围内。

对 update_from_provider_tree() 的更改¶

将 SchedulerReportClient.update_from_provider_tree() 方法更改为接受一个新的参数 allocations

def update_from_provider_tree(self, context, new_tree, allocations):
    """Flush changes from a specified ProviderTree back to placement.

    ...

    ...
    :param allocations: A dict, keyed by consumer UUID, of allocation records
            of the form returned by GET /allocations/{consumer_uuid}. The
            dict must represent the comprehensive final picture of the
            allocations for each consumer therein. A value of None indicates
            that no reshape is being performed.
    ...
    """

当 allocations 为 None 时，update_from_provider_tree() 的行为与之前（在 Queens 中）相同。

对 Resource Tracker _update() 的更改¶

将 _update() 方法获得一个新的参数，startup，它从 update_available_resource() 传递下来。

在当前调用 update_provider_tree() 和 update_from_provider_tree() 的地方，代码流程将更改为大约

try:
    self.driver.update_provider_tree(prov_tree, nodename)
except exception.ReshapeNeeded:
    if not startup:
        # Treat this like a regular exception during periodic
        raise
    LOG.info("Performing resource provider inventory and "
             "allocation data migration during compute service "
             "startup or FFU.")
    allocs = reportclient.get_allocations_for_provider_tree()
    self.driver.update_provider_tree(prov_tree, nodename,
                                     allocations=allocs)
...
reportclient.update_from_provider_tree(context, prov_tree, allocs)

对 _update_available_resource_for_node() 的更改¶

这是目前捕获所有 Resource Tracker _update() 周期性任务异常、记录和忽略的地方。

我们将添加一个新的参数，startup，从 update_available_resource() 传递下来，以及一个新的 except 子句，形式如下

except exception.ResourceProviderUpdateFailed:
    if startup:
        # Kill the compute service.
        raise
    # Else log a useful exception reporting what happened and maybe even how
    # to fix it; and then carry on.

其目的是使 update_from_provider_tree() 中的异常仅在启动时致命。

Placement POST /reshaper¶

在一个新的 placement 微版本中，将引入一个新的 POST /reshaper 操作。有效负载的形式如下

{
  "inventories": [
    $RP_UUID: {
      # This is the exact payload format for
      # PUT /resource_provider/$RP_UUID/inventories.
      # It should represent the final state of the entire set of resources
      # for this provider. In particular, omitting a $RC dict will cause the
      # inventory for that resource class to be deleted if previously present.
      "inventories": { $RC: { <total, reserved, etc.> } }
      "resource_provider_generation": <gen of this RP>,
    },
    $RP_UUID: { ... },
  ],
  "allocations": [
    # This is the exact payload format for POST /allocations
    $CONSUMER_UUID: {
      "project_id": $PROJ_ID,
      "user_id": $USER_ID,
      # This field is part of the consumer generation series under review,
      # not yet in the published POST /allocations payload.
      "consumer_generation": $CONSUMER_GEN,
      "allocations": {
        $RP_UUID: {
          "resources": { $RC: $AMOUNT, ... }
        },
        $RP_UUID: { ... }
      }
    },
    $CONSUMER_UUID: { ... }
  ]
}

在一个原子事务中，placement 替换 inventories 字典中的每个 $RP_UUID 的库存；并替换 allocations 字典中的每个 $CONSUMER_UUID 的分配。

返回值

204 No Content 成功时。
409 Conflict 任何提供者或消费者生成冲突时；或者如果检测到并发事务。至少对于前者，应使用适当的错误代码，以便调用者可以判断是否需要刷新 GET 才能重新计算必要的重塑并重试操作。
400 Bad Request 任何其他故障时。

直接访问 Placement¶

为了使离线升级脚本成为可能，我们需要使 placement 可通过导入 Python 代码而不是作为独立服务来访问。最快的途径是使用 wsgi-intercept 允许 HTTP 交互，使用 requests 库，仅通过网络进行数据库流量。这允许客户端代码使用相同的 API 修改 placement 数据存储，而无需运行 placement 服务。

作为一个上下文管理器 PlacementDirect 的实现已经合并。该上下文管理器接受一个 oslo config，由调用者填充。这允许调用代码控制它希望如何发现配置设置，最重要的是 placement 使用的数据库。

此实现为 Placement POST /reshaper 的离线使用提供了快速解决方案，同时允许在未来进行更漂亮的解决方案。

离线升级脚本¶

为了方便快速前进升级，我们将提供一个脚本，在所有服务（数据库除外）离线时执行此重塑。它看起来像

nova-manage placement migrate_compute_inventory

…并按以下方式运行，对于主机上的每个 nodename（ironic 除外）

使用直接访问 Placement 启动 SchedulerReportClient。
通过 SchedulerReportClient.get_provider_tree_and_ensure_root() 获取 ProviderTree。
实例化适当的虚拟驱动程序。
执行 Resource Tracker _update() 中描述的算法，就好像 startup 为 True 一样。

我们可能参考 https://review.openstack.org/#/c/501025/ 获取需要虚拟驱动程序的升级脚本示例。

备选方案¶

Reshaper API¶

在邮件列表线程、etherpad、IRC、hangout 等中讨论了 Placement POST /reshaper 的替代方案。它们包括

不提供原子 placement 操作——从资源跟踪器一次执行必要的操作。由于竞争条件而被拒绝：调度程序可能会根据不正确的容量信息对正在移动的库存进行调度。
“锁定”正在移动的库存——通过提供锁定 API 或通过设置 reserved = total——同时资源跟踪器执行重塑。由于这是一个 hack；并且从部分故障中恢复会很困难而被拒绝。
“合并”形式的新 placement 操作
- PATCH（或 POST）带有 RFC 6902 样式的 "operation", "path"[, "from", "value"] 指令。
- PATCH（或 POST）带有 RFC 7396 语义。JSON 有效负载看起来像 Placement POST /reshaper 中描述的稀疏版本，但仅包含更改。
placement 操作的其他有效负载格式（请参阅 etherpad）。我们选择我们所做的，因为它重用了现有的有效负载语法（因此可能能够重用代码），并且它提供了对预期最终状态的完整规范，这是一种 RESTful 的方式。

直接 Placement¶

对直接访问 Placement 的 wsgi-intercept 模型的替代方案

直接访问对象方法（进行一些重构/清理）。被拒绝，因为我们会丢失诸如模式验证和微版本逻辑之类的东西。
为这些对象方法创建更简洁、更符合 Python 风格的封装器。短期内被拒绝，为了追求效率。如果对直接放置的需求超出 FFU 脚本的范围，我们可能会在长期内采用这种方法。
使用 wsgi-intercept，但在 REST 层之外创建 Python 风格的封装器。这也是一个长期的选择。

通过 update_provider_tree() 进行重塑¶

我们考虑每次都将分配传递给 update_provider_tree()，但收集分配会很昂贵，所以我们需要一种仅在必要时才执行此操作的方法。于是出现了 ReshapeNeeded 异常。
我们考虑在每个周期性间隔运行检查和重塑（如果需要）算法，但决定我们除了在启动时之外，永远不需要进行重塑。

数据模型影响¶

无。

REST API 影响¶

请参阅 Placement POST /reshaper。

安全影响¶

无。

通知影响¶

无。

其他最终用户影响¶

请参阅升级影响。

性能影响¶

新的 Placement POST /reshaper 操作可能很慢，并且会锁定多个表。其使用应限制为重塑提供者树。最初，我们可能会使用来自 update_from_provider_tree() 的重塑器，即使没有进行重塑；但如果发现这对性能有问题，我们可以将其限制为仅重塑场景，这将非常罕见。

收集分配，尤其是在大型部署中，可能会很耗时且缓慢，因此我们仅在计算启动时执行此操作，并且仅当 update_provider_tree() 指示需要重塑时才执行。

其他部署者影响¶

请参阅升级影响。

开发人员影响¶

请参阅虚拟驱动程序。

升级影响¶

支持实时升级。Resource Tracker _update() 流将在计算启动时运行，并根据需要执行重塑。由于我们不支持在实时升级时跳过发布版本，因此可以将 virt 驱动程序特定的更改从一个发布版本中删除到下一个发布版本。

为快速转发升级提供了离线升级脚本。由于代码在 FFU 的每个步骤中都使用每个发布版本的代码库运行，因此可以将 virt 驱动程序特定的更改从一个发布版本中删除到下一个发布版本。但是，请注意，该脚本始终必须运行，因为只有在特定计算上运行的 virt 驱动程序才能确定该计算是否需要重塑。（如果不需要重塑，则该脚本将不执行任何操作。）

实现¶

负责人¶

Placement POST /reshaper：jaypipes (SQL-fu)，cdent (API 管道)
直接访问 Placement：cdent
报告客户端、资源跟踪器、virt 驱动程序一致性：efried
离线升级脚本：dansmith
评论和一般捣鼓：mriedem, bauzas, gibi, edleafe, alex_xu

工作项¶

参见提议的变更。

依赖项¶

测试¶

为所有人增强功能测试，包括针对 Placement POST /reshaper 的 gabbi 测试。

通过他们的 VGPU 工作，在 Xen (naichuans) 和 libvirt (bauzas) 中进行实时测试。

文档影响¶

Placement POST /reshaper（placement API 参考）
离线升级脚本 (nova-manage db)

参考资料¶

消费者代数规范
嵌套资源提供程序 - 分配候选者
Placement 重塑 API 讨论 etherpad
升级问题… 邮件列表主题
RFC 6902 (PATCH with json-patch+json)
RFC 7396 (PATCH with merge-patch+json)
nova-manage db 迁移助手文档
wsgi-intercept
Python requests
PlacementDirect 实现
oslo config 库

历史¶

修订版¶
发布名称	描述
Rocky	引入

处理重塑的提供程序树

处理重塑的提供者树¶

问题描述¶

用例¶

提议的变更¶

SchedulerReportClient.get_allocations_for_provider_tree()¶

ReshapeNeeded 异常¶

对 update_provider_tree() 的更改¶

分配参数¶

虚拟驱动程序¶

对 update_from_provider_tree() 的更改¶

对 Resource Tracker _update() 的更改¶

对 _update_available_resource_for_node() 的更改¶

Placement POST /reshaper¶

直接访问 Placement¶

离线升级脚本¶

备选方案¶

Reshaper API¶

直接 Placement¶

通过 update_provider_tree() 进行重塑¶

数据模型影响¶

REST API 影响¶

安全影响¶

通知影响¶

其他最终用户影响¶

性能影响¶

其他部署者影响¶

开发人员影响¶

升级影响¶

实现¶

负责人¶

工作项¶

依赖项¶

测试¶

文档影响¶

参考资料¶

历史¶

Nova Specs

页面内容