這是本節的多頁可打印視圖。點擊此處打印.

叢集管理

1: 使用 DRA 安裝驅動程式並配置裝置

1 - 使用 DRA 安裝驅動程式並配置裝置

FEATURE STATE: Kubernetes v1.35 [stable](enabled by default)

本教學示範如何在叢集中安裝動態資源分配（DRA）驅動程式，以及如何使用 DRA API 將裝置分配給 Pod。本頁面適合叢集管理員閱讀。

動態資源分配（DRA）讓叢集得以管理硬體資源的可用性與分配情況，從而滿足 Pod 對硬體資源的需求與偏好。為了支援此功能，Kubernetes 內建組件（如 Kubernetes 排程器、kubelet 和 kube-controller-manager）與裝置擁有者提供的第三方驅動程式（稱為 DRA 驅動程式）共同負責在 Pod 生命週期中宣告、分配、準備、掛載、健康檢查、解除準備及清理資源。這些組件透過 resource.k8s.io API 群組中的一系列 DRA 專用 API 共享資訊，包括 DeviceClasses、ResourceSlices、ResourceClaims，以及 Pod 規格本身的新欄位。

學習目標

部署範例 DRA 驅動程式
部署使用 DRA API 請求硬體資源的 Pod
刪除具有資源請求的 Pod

開始之前

您的叢集應支援 RBAC。您可以在使用其他授權機制的叢集上嘗試本教學，但在這種情況下，您需要調整定義角色和權限的步驟。

您需要有一個 Kubernetes 叢集，且必須設定 kubectl 命令列工具使其能與叢集通訊。建議在至少有兩個未擔任控制平面主機之節點的叢集上執行本教學。如果您還沒有叢集，可以使用 Minikube 建立一個，或使用以下其中一個 Kubernetes 練習環境：

本教學已在 Linux 節點上測試，但也可能適用於其他類型的節點。

您的 Kubernetes 伺服器版本必須不低於 v1.34.

若要確認版本，請輸入 kubectl version.

若您的叢集目前未執行 Kubernetes 1.35，請查閱您計劃使用的 Kubernetes 版本的文件。

探索叢集的初始狀態

您可以花些時間觀察啟用 DRA 的叢集初始狀態，這對尚未熟悉這些 API 的使用者特別有幫助。若您為本教學設定了新叢集，且尚未安裝驅動程式也沒有待滿足的 Pod 請求，這些指令的輸出將不會顯示任何資源。

取得 DeviceClasses 清單：
```
kubectl get deviceclasses
```
輸出類似如下：
```
No resources found
```

取得 ResourceSlices 清單：
```
kubectl get resourceslices
```
輸出類似如下：
```
No resources found
```

取得 ResourceClaims 和 ResourceClaimTemplates 清單：

kubectl get resourceclaims -A
kubectl get resourceclaimtemplates -A

輸出類似如下：

No resources found
No resources found

至此，您已確認 DRA 在叢集中已啟用且設定正確，且目前尚無 DRA 驅動程式向 DRA API 公告任何資源。

安裝範例 DRA 驅動程式

DRA 驅動程式是在叢集每個節點上執行的第三方應用程式，用於與該節點的硬體及 Kubernetes 內建 DRA 組件介接。安裝程序取決於您選擇的驅動程式，但通常會以 DaemonSet 的形式部署到叢集中的全部或部分節點（使用選擇器或類似機制）。

請查閱您的驅動程式文件以取得特定安裝說明，其中可能包含 Helm chart、一組設定檔或其他部署工具。

本教學使用 kubernetes-sigs/dra-example-driver 儲存庫中的範例驅動程式，示範如何安裝驅動程式。此範例驅動程式會向 Kubernetes 宣告模擬 GPU，讓 Pod 可使用這些 GPU。

準備叢集以安裝驅動程式

為了簡化清理工作，請建立名為 dra-tutorial 的命名空間：

建立命名空間：
```
kubectl create namespace dra-tutorial 
```

在正式環境中，您通常會使用驅動程式廠商或您所屬組織先前發佈或認證的映像檔，且您的節點需要能夠存取託管驅動程式映像檔的映像檔儲存庫。在本教學中，您將使用公開發佈的 dra-example-driver 映像檔，模擬使用 DRA 驅動程式映像檔的情境。

在叢集的其中一個節點上執行以下指令，確認節點可以存取映像檔：
```
docker pull registry.k8s.io/dra-example-driver/dra-example-driver:v0.2.0
```

部署 DRA 驅動程式組件

在本教學中，您將使用 kubectl 逐一安裝重要的範例資源驅動程式組件。

建立代表此 DRA 驅動程式所支援裝置類型的 DeviceClass：

dra/driver-install/deviceclass.yaml

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: gpu.example.com
spec:
  selectors:
  - cel: 
      expression: "device.driver == 'gpu.example.com'"

kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/deviceclass.yaml

建立驅動程式存取此叢集 Kubernetes API 所需的 ServiceAccount、ClusterRole 和 ClusterRoleBinding：

建立 ServiceAccount：

dra/driver-install/serviceaccount.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: dra-example-driver-service-account
  namespace: dra-tutorial
  labels:
    app.kubernetes.io/name: dra-example-driver
    app.kubernetes.io/instance: dra-example-driver

kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/serviceaccount.yaml

建立 ClusterRole：

dra/driver-install/clusterrole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dra-example-driver-role
rules:
- apiGroups: ["resource.k8s.io"]
  resources: ["resourceclaims"]
  verbs: ["get"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get"]
- apiGroups: ["resource.k8s.io"]
  resources: ["resourceslices"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/clusterrole.yaml

建立 ClusterRoleBinding：

dra/driver-install/clusterrolebinding.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dra-example-driver-role-binding
subjects:
- kind: ServiceAccount
  name: dra-example-driver-service-account
  namespace: dra-tutorial
roleRef:
  kind: ClusterRole
  name: dra-example-driver-role
  apiGroup: rbac.authorization.k8s.io

kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/clusterrolebinding.yaml

為 DRA 驅動程式建立 PriorityClass。 PriorityClass 可防止 DRA 驅動程式組件被搶佔，該組件負責處理具有資源請求的 Pod 的重要生命週期操作。深入了解 Pod 優先權與搶佔。

dra/driver-install/priorityclass.yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: dra-driver-high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for DRA driver pods only."

kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/priorityclass.yaml

將實際的 DRA 驅動程式部署為 DaemonSet，並設定為使用上述已配置的權限執行範例驅動程式二進位檔。 DaemonSet 具有您在前述步驟中授予 ServiceAccount 的權限。

dra/driver-install/daemonset.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dra-example-driver-kubeletplugin
  namespace: dra-tutorial
  labels:
    app.kubernetes.io/name: dra-example-driver
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dra-example-driver
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dra-example-driver
    spec:
      priorityClassName: dra-driver-high-priority
      serviceAccountName: dra-example-driver-service-account
      securityContext:
        {}
      containers:
      - name: plugin
        securityContext:
          privileged: true
        image: registry.k8s.io/dra-example-driver/dra-example-driver:v0.2.0
        imagePullPolicy: IfNotPresent
        command: ["dra-example-kubeletplugin"]
        resources:
          {}
        # Production drivers should always implement a liveness probe
        # For the tutorial we simply omit it
        # livenessProbe:
        #   grpc:
        #     port: 51515
        #     service: liveness
        #   failureThreshold: 3
        #   periodSeconds: 10
        env:
        - name: CDI_ROOT
          value: /var/run/cdi
        - name: KUBELET_REGISTRAR_DIRECTORY_PATH
          value: "/var/lib/kubelet/plugins_registry"
        - name: KUBELET_PLUGINS_DIRECTORY_PATH
          value: "/var/lib/kubelet/plugins"
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        # Simulated number of devices the example driver will pretend to have.
        - name: NUM_DEVICES
          value: "9"
        - name: HEALTHCHECK_PORT
          value: "51515"
        volumeMounts:
        - name: plugins-registry
          mountPath: "/var/lib/kubelet/plugins_registry"
        - name: plugins
          mountPath: "/var/lib/kubelet/plugins"
        - name: cdi
          mountPath: /var/run/cdi
      volumes:
      - name: plugins-registry
        hostPath:
          path: "/var/lib/kubelet/plugins_registry"
      - name: plugins
        hostPath:
          path: "/var/lib/kubelet/plugins"
      - name: cdi
        hostPath:
          path: /var/run/cdi

kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/daemonset.yaml

DaemonSet 設定了與底層容器裝置介面（CDI）目錄互動所需的卷掛載，並透過 kubelet/plugins 目錄向 kubelet 公開其 socket。

驗證 DRA 驅動程式安裝

取得所有工作節點上 DRA 驅動程式 DaemonSet 的 Pod 清單：

kubectl get pod -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial

輸出類似如下：

NAME                                     READY   STATUS    RESTARTS   AGE
dra-example-driver-kubeletplugin-4sk2x   1/1     Running   0          13s
dra-example-driver-kubeletplugin-cttr2   1/1     Running   0          13s

每個節點本地 DRA 驅動程式的初始職責是透過將其中繼資料發布到 ResourceSlices API，將該節點上 Pod 可使用的裝置資訊更新到叢集中。您可以查看該 API，確認每個安裝了驅動程式的節點都在公告其所代表的裝置類別。

查看可用的 ResourceSlices：
```
kubectl get resourceslices
```
輸出類似如下：
```
NAME                                 NODE           DRIVER            POOL           AGE
kind-worker-gpu.example.com-k69gd    kind-worker    gpu.example.com   kind-worker    19s
kind-worker2-gpu.example.com-qdgpn   kind-worker2   gpu.example.com   kind-worker2   19s
```

至此，您已成功安裝範例 DRA 驅動程式並確認其初始設定。您現在可以使用 DRA 來排程 Pod。

請求資源並部署 Pod

若要使用 DRA 請求資源，您需要建立 ResourceClaims 或 ResourceClaimTemplates 來定義 Pod 所需的資源。在範例驅動程式中，模擬 GPU 裝置會提供記憶體容量屬性。本節說明如何使用 Common Expression Language 在 ResourceClaim 中表達您的需求、在 Pod 規格中選取該 ResourceClaim，以及觀察資源分配情況。

本教學僅展示 DRA ResourceClaim 的一個基本範例。請閱讀動態資源分配以深入了解 ResourceClaims。

建立 ResourceClaim

在本節中，您將建立一個 ResourceClaim 並在 Pod 中引用它。無論請求內容為何，deviceClassName 都是必填欄位，用於將請求範圍縮小到特定裝置類別。請求本身可以包含 Common Expression Language 表達式，用來引用驅動程式針對該裝置類別所公告的屬性。

在本範例中，您將建立一個 ResourceClaim，用來請求任何宣告記憶體容量超過 10Gi 的 GPU。範例驅動程式用來表示容量的屬性格式為 device.capacity['gpu.example.com'].memory。另請注意，此請求的名稱設定為 some-gpu。

dra/driver-install/example/resourceclaim.yaml

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
 name: some-gpu
 namespace: dra-tutorial
spec:
   devices:
     requests:
     - name: some-gpu
       exactly:
         deviceClassName: gpu.example.com
         selectors:
         - cel:
             expression: "device.capacity['gpu.example.com'].memory.compareTo(quantity('10Gi')) >= 0"

kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/example/resourceclaim.yaml

建立引用 ResourceClaim 的 Pod

以下是 Pod 設定檔，其 spec.resourceClaims.resourceClaimName 欄位引用了剛建立的 some-gpu ResourceClaim。該請求的別名 gpu 會在 spec.containers.resources.claims.name 欄位中使用，以將該請求分配給 Pod 中的容器。

dra/driver-install/example/pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: pod0
  namespace: dra-tutorial
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:24.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimName: some-gpu

kubectl apply --server-side -f http://k8s.io/examples/dra/driver-install/example/pod.yaml

確認 Pod 已部署：

kubectl get pod pod0 -n dra-tutorial

輸出類似如下：

NAME   READY   STATUS    RESTARTS   AGE
pod0   1/1     Running   0          9s

探索 DRA 狀態

建立 Pod 後，叢集會嘗試將該 Pod 排程到 Kubernetes 可以滿足 ResourceClaim 的節點。在本教學中，DRA 驅動程式部署在所有節點上，並在所有節點上公告模擬 GPU，所有節點公告的容量都足以滿足 Pod 的請求，因此 Kubernetes 可以將此 Pod 排程到任何節點並分配該節點上的任何模擬 GPU。

當 Kubernetes 將模擬 GPU 分配給 Pod 時，範例驅動程式會為每個被分配到該裝置的容器新增環境變數，用來顯示真實資源驅動程式原本會注入哪些 GPU，以及其設定方式。您可以查看這些環境變數來了解系統如何處理該 Pod。

查看 Pod 日誌，其中記錄了已分配的模擬 GPU 名稱：

kubectl logs pod0 -c ctr0 -n dra-tutorial | grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"

輸出類似如下：

declare -x GPU_DEVICE_0="gpu-0"

查看 ResourceClaim 物件的狀態：
```
kubectl get resourceclaims -n dra-tutorial
```
輸出類似如下：
```
NAME       STATE                AGE
some-gpu   allocated,reserved   34s
```
在此輸出中，STATE 欄位顯示 ResourceClaim 已被分配且已保留。

查看 some-gpu ResourceClaim 的詳細資訊。 ResourceClaim 的 status 區段包含已分配裝置及其保留對象 Pod 的資訊：

kubectl get resourceclaim some-gpu -n dra-tutorial -o yaml

輸出類似如下：

 2 3 4  5  6  7  8  9 10 1112 13 14 15 16 17 18 19 20 21 22 2324 25 26 2728293031 32 33 34 35 36 37 38394041 1apiVersion: resource.k8s.io/v1 style="color:#008000;font-weight:bold">kind: ResourceClaim style="color:#008000;font-weight:bold">metadata: creationTimestamp: "2025-08-20T18:17:31Z" finalizers: - resource.kubernetes.io/delete-protection name: some-gpu namespace: dra-tutorial resourceVersion: "2326" uid: d3e48dbf-40da-47c3-a7b9-f7d54d1051c3 style="color:#008000;font-weight:bold">spec: devices: requests: - exactly: allocationMode: ExactCount count: 1 deviceClassName: gpu.example.com selectors: - cel: expression: device.capacity['gpu.example.com'].memory.compareTo(quantity('10Gi')) >= 0 name: some-gpu style="color:#008000;font-weight:bold">status: allocation: devices: results: style="color:#bbb">        - device: gpu-0 style="color:#bbb">            driver: gpu.example.com style="color:#bbb">            pool: kind-worker style="color:#bbb">            request: some-gpu nodeSelector: nodeSelectorTerms: - matchFields: - key: metadata.name operator: In values: - kind-worker style="color:#bbb">    reservedFor: style="color:#bbb">    - name: pod0 style="color:#bbb">        resource: pods style="color:#bbb">        uid: c4dadf20-392a-474d-a47b-ab82080c8bd7

若要查看驅動程式如何處理裝置分配，請取得驅動程式 DaemonSet Pod 的日誌：

kubectl logs -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial

輸出類似如下：

I0820 18:17:44.131324       1 driver.go:106] PrepareResourceClaims is called: number of claims: 1
I0820 18:17:44.135056       1 driver.go:133] Returning newly prepared devices for claim 'd3e48dbf-40da-47c3-a7b9-f7d54d1051c3': [{[some-gpu] kind-worker gpu-0 [k8s.gpu.example.com/gpu=common k8s.gpu.example.com/gpu=d3e48dbf-40da-47c3-a7b9-f7d54d1051c3-gpu-0]}]

您現在已成功部署了一個使用 DRA 請求裝置的 Pod，確認該 Pod 已被排程到適當的節點，並確認相關的 DRA API 物件已反映最新的分配狀態。

刪除具有資源請求的 Pod

當具有資源請求的 Pod 被刪除時，DRA 驅動程式會釋放資源，使其可用於未來的排程。為了驗證此行為，請刪除您在前述步驟中建立的 Pod，並觀察 ResourceClaim 和驅動程式的相應變更。

刪除 pod0 Pod：

kubectl delete pod pod0 -n dra-tutorial

輸出類似如下：

pod "pod0" deleted

觀察 DRA 狀態

當 Pod 被刪除時，驅動程式會從 ResourceClaim 中釋放裝置，並更新 Kubernetes API 中的 ResourceClaim 資源。 ResourceClaim 將保持 pending 狀態，直到它被新的 Pod 引用為止。

查看 some-gpu ResourceClaim 的狀態：

kubectl get resourceclaims -n dra-tutorial

輸出類似如下：

NAME       STATE     AGE
some-gpu   pending   76s

查看驅動程式日誌，確認驅動程式已處理此請求的裝置解除準備作業：

kubectl logs -l app.kubernetes.io/name=dra-example-driver -n dra-tutorial

輸出類似如下：

I0820 18:22:15.629376       1 driver.go:138] UnprepareResourceClaims is called: number of claims: 1

您現在已刪除了具有資源請求的 Pod，並觀察到驅動程式已釋放相關硬體資源，同時更新 DRA API，讓該資源再次可供排程使用。

清理資源

若要清理本教學中建立的資源，請執行以下步驟：

kubectl delete namespace dra-tutorial
kubectl delete deviceclass gpu.example.com
kubectl delete clusterrole dra-example-driver-role
kubectl delete clusterrolebinding dra-example-driver-role-binding
kubectl delete priorityclass dra-driver-high-priority

叢集管理

1 - 使用 DRA 安裝驅動程式並配置裝置

學習目標

開始之前

探索叢集的初始狀態

安裝範例 DRA 驅動程式

準備叢集以安裝驅動程式

部署 DRA 驅動程式組件

驗證 DRA 驅動程式安裝

請求資源並部署 Pod

建立 ResourceClaim

建立引用 ResourceClaim 的 Pod

探索 DRA 狀態

刪除具有資源請求的 Pod

觀察 DRA 狀態

清理資源

接下來