{"componentChunkName":"component---src-templates-acg-portal-new-template-tsx","path":"/smohxoure","result":{"data":{"markdownRemark":{"html":"<p>数据处理算子可使用百舸提供的 aihc-daft 包进行开发。aihc-daft 是百度 AI 异构计算平台（AIHC）推出的多模态 AI 数据处理框架，基于 <a href=\"https://github.com/Eventual-Inc/Daft\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Daft</a> 构建，提供开箱即用的数据处理算子库，支持单机多核到多机分布式的弹性扩展，面向 AI 训练数据生产场景。</p>\n<h2 id=\"daft-核心特性\"><a href=\"#daft-%E6%A0%B8%E5%BF%83%E7%89%B9%E6%80%A7\" aria-label=\"daft 核心特性 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Daft 核心特性</h2>\n<ul>\n<li><strong>弹性分布式执行</strong>。支持单机多核（Native Runner）与多机集群（Ray Runner）两种执行模式，业务代码无需修改，一行配置即可从单机无缝扩展至百节点集群，满足从开发调试到 PB 级数据生产的全场景需求。</li>\n<li><strong>惰性求值与查询优化</strong>。 采用惰性执行模型，所有数据变换操作仅构建逻辑计划，在触发 <code>collect()</code> / <code>show()</code> / <code>write_*()</code> 时统一优化执行，自动进行谓词下推、列裁剪等优化，减少不必要的 I/O 和计算开销。</li>\n<li>\n<p><strong>丰富的数据格式支持</strong>。原生支持主流数据格式的读写，包括：</p>\n<ul>\n<li>结构化数据：Parquet、CSV、JSON、SQL 数据库</li>\n<li>数据湖格式：Delta Lake、Apache Iceberg、Apache Hudi、Lance</li>\n<li>AI 数据集：HuggingFace Hub 数据集</li>\n<li>多媒体：视频帧序列、WARC 网页归档、MCAP 机器人传感器数据</li>\n</ul>\n</li>\n<li><strong>多模态数据类型原生支持</strong>内置 <code>Image</code>、<code>Video</code>、<code>Audio</code> 等多媒体数据类型，支持直接在 DataFrame 列中存储和处理图像、音视频数据，无需手动序列化。</li>\n<li><strong>灵活的 UDF（用户自定义函数）系统</strong> 提供完整的 UDF 开发框架，支持声明 CPU、GPU、内存等资源需求，框架自动完成任务调度与资源分配。支持批处理模式、并发控制、进程/线程隔离，满足 CPU 密集与 GPU 推理等不同场景。</li>\n<li><strong>GPU 原生调度支持</strong> UDF 可声明所需 GPU 资源（支持小数，如 <code>num_gpus=0.5</code>），框架与 Ray 协同完成 GPU 感知调度，天然适配深度学习推理、向量化等 GPU 密集型算子。</li>\n<li><strong>SQL 查询支持</strong>支持直接使用 SQL 语法对 DataFrame 进行查询（<code>daft.sql()</code>），降低数据处理门槛，兼容熟悉 SQL 的用户习惯。</li>\n<li><strong>多种存储后端统一接入</strong>通过统一的存储抽象层，支持本地文件系统、百度对象存储（BOS）、AWS S3、HTTP(S) 等多种存储后端，使用相同 API 访问不同存储，路径前缀自动路由。</li>\n<li><strong>DataFrame API 简洁直观</strong>提供类 Pandas 的 DataFrame 操作接口，支持 <code>select</code>、<code>filter</code>、<code>groupby</code>、<code>join</code>、<code>sort</code>、<code>limit</code> 等常用操作，以及窗口函数（<code>Window</code>），学习成本低。</li>\n<li><strong>数据湖 Catalog 集成</strong>支持与 Apache Iceberg、Apache Gravitino、Unity Catalog 等主流数据湖 Catalog 集成，实现数据治理、表版本管理与跨平台数据共享。</li>\n</ul>\n<h2 id=\"集成aihc-daft方式\"><a href=\"#%E9%9B%86%E6%88%90aihc-daft%E6%96%B9%E5%BC%8F\" aria-label=\"集成aihc daft方式 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>集成aihc-daft方式</h2>\n<p>你可以通过镜像或者 pip 包的方式，集成 aihc-daft。</p>\n<ol>\n<li>镜像 </li>\n</ol>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">Plain Text</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-text\"><code><span class=\"line-number\">1</span> ccr-registry.baidubce.com/aihc/aihc-daft-gpu:0.3.2-cu12.1-py3.11-ubuntu22.04</code></pre>\n            </div>\n        </div>\n    </div>\n  \n<p>镜像已经预置aihc-Daft 以及多模态算子相关的依赖以及运行环境，比如 Cuda、Conda、Ray 等，推荐直接使用</p>\n<ol start=\"2\">\n<li>\n<p>pip包离线安装</p>\n<ul>\n<li>Aihc-daft 尚未在pypi 仓库发布，用户需要下载 aihc-daft 的离线安装包。<a href=\"https://aihc-public.bj.bcebos.com/aihc_daft/aihc_daft-0.3.2-cp310-abi3-manylinux_2_12_x86_64.whl\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">点击下载</a></li>\n<li>执行以下命令安装</li>\n</ul>\n</li>\n</ol>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">JSON</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-json\"><code><span class=\"line-number\">1</span>pip install aihc_daft<span class=\"token number\">-0.3</span>.<span class=\"token number\">2</span>-cp<span class=\"token number\">310</span>-abi<span class=\"token number\">3</span>-manylinux_<span class=\"token number\">2</span>_<span class=\"token number\">12</span>_x<span class=\"token number\">86</span>_<span class=\"token number\">64.</span>whl</code></pre>\n            </div>\n        </div>\n    </div>\n  \n<h2 id=\"aihc-daft内置算子示例\"><a href=\"#aihc-daft%E5%86%85%E7%BD%AE%E7%AE%97%E5%AD%90%E7%A4%BA%E4%BE%8B\" aria-label=\"aihc daft内置算子示例 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>aihc-daft内置算子示例</h2>\n<p>这里以<a href=\"https://cloud.baidu.com/doc/AIHC/s/Jmo9zq0kr\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">图片哈希计算处理器算子</a>为例</p>\n<p>test_image_hash.py脚本如下：</p>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">Plain Text</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-text\"><code><span class=\"line-number\">1</span>from __future__ import annotations\n<span class=\"line-number\">2</span>\n<span class=\"line-number\">3</span>import os\n<span class=\"line-number\">4</span>import daft\n<span class=\"line-number\">5</span>from daft import col\n<span class=\"line-number\">6</span>\n<span class=\"line-number\">7</span>from daft.aihc.common.udf import aihc_udf\n<span class=\"line-number\">8</span>from daft.aihc.functions.image.image_hash import ImageHash\n<span class=\"line-number\">9</span>\n<span class=\"line-number\">10</span>if __name__ == &quot;__main__&quot;:\n<span class=\"line-number\">11</span>    if os.getenv(&quot;DAFT_RUNNER&quot;, &quot;native&quot;) == &quot;ray&quot;:\n<span class=\"line-number\">12</span>        import ray\n<span class=\"line-number\">13</span>        ray.init(dashboard_host=&quot;0.0.0.0&quot;, ignore_reinit_error=True)\n<span class=\"line-number\">14</span>        daft.set_runner_ray()\n<span class=\"line-number\">15</span>    daft.set_execution_config(actor_udf_ready_timeout=6000, min_cpu_per_task=0)\n<span class=\"line-number\">16</span>\n<span class=\"line-number\">17</span>    samples = {\n<span class=\"line-number\">18</span>        &quot;image&quot;: [\n<span class=\"line-number\">19</span>            &quot;file:///local/sample_1.jpg&quot;,\n<span class=\"line-number\">20</span>            &quot;file:///mnt/pfs/sample_2.jpg&quot;,\n<span class=\"line-number\">21</span>            &quot;file:///mnt/bos/sample_3.jpg&quot;,\n<span class=\"line-number\">22</span>        ]\n<span class=\"line-number\">23</span>    }\n<span class=\"line-number\">24</span>    \n<span class=\"line-number\">25</span>    num_datasets = len(samples[&quot;image&quot;]) \n<span class=\"line-number\">26</span>    ds = daft.from_pydict(samples).into_partitions(num_datasets) #强制分布式切分partitions\n<span class=\"line-number\">27</span>    ds = ds.with_column(\n<span class=\"line-number\">28</span>        &quot;image_hash&quot;,\n<span class=\"line-number\">29</span>        aihc_udf(\n<span class=\"line-number\">30</span>            ImageHash,\n<span class=\"line-number\">31</span>            construct_args={\n<span class=\"line-number\">32</span>                &quot;image_src_type&quot;: &quot;image_url&quot;,\n<span class=\"line-number\">33</span>                &quot;method&quot;: &quot;phash&quot;,\n<span class=\"line-number\">34</span>            },\n<span class=\"line-number\">35</span>            num_cpus=0.5,\n<span class=\"line-number\">36</span>            batch_size=1,\n<span class=\"line-number\">37</span>            concurrency=num_datasets,  # 多个数据集并发执行\n<span class=\"line-number\">38</span>        )(col(&quot;image&quot;)),\n<span class=\"line-number\">39</span>    )\n<span class=\"line-number\">40</span>    ds.show()</code></pre>\n            </div>\n        </div>\n    </div>\n  \n<p>分布式数据处理执行命令</p>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">Plain Text</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-text\"><code><span class=\"line-number\">1</span>#使用DAFT_RUNNER=ray基于ray分布式执行\n<span class=\"line-number\">2</span>DAFT_RUNNER=ray python test_image_hash.py \n<span class=\"line-number\">3</span>\n<span class=\"line-number\">4</span>#单机执行\n<span class=\"line-number\">5</span>python test_image_hash.py</code></pre>\n            </div>\n        </div>\n    </div>\n  \n<h2 id=\"aihc-daft-基础参数说明\"><a href=\"#aihc-daft-%E5%9F%BA%E7%A1%80%E5%8F%82%E6%95%B0%E8%AF%B4%E6%98%8E\" aria-label=\"aihc daft 基础参数说明 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>aihc-daft 基础参数说明</h2>\n<h3 id=\"aihc_udf-参数说明\"><a href=\"#aihc_udf-%E5%8F%82%E6%95%B0%E8%AF%B4%E6%98%8E\" aria-label=\"aihc_udf 参数说明 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>aihc_udf 参数说明</h3>\n<table>\n<thead>\n<tr>\n<th>参数</th>\n<th>说明</th>\n<th>默认值</th>\n<th>示例</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>operator</code></td>\n<td>算子类（必填）</td>\n<td>—</td>\n<td><code>ImageHash</code></td>\n</tr>\n<tr>\n<td><code>construct_args</code></td>\n<td>传给算子初始化的参数</td>\n<td><code>{}</code></td>\n<td>construct_args={<br/>                \"image_src_type\": \"image_url\",<br/>                \"method\": \"phash\",<br/>            }</td>\n</tr>\n<tr>\n<td><code>num_cpus</code></td>\n<td>每实例占用 CPU 核数</td>\n<td><code>None</code>（由调度器自动分配）</td>\n<td><code>2</code></td>\n</tr>\n<tr>\n<td><code>num_gpus</code></td>\n<td>每实例占用 GPU 卡数</td>\n<td><code>None</code>（不使用 GPU）</td>\n<td><code>1</code> 或 <code>0.5</code></td>\n</tr>\n<tr>\n<td><code>memory_bytes</code></td>\n<td>每实例内存上限（字节）</td>\n<td><code>None</code>（不限制）</td>\n<td><code>2 * 1024^3</code>（2GB）</td>\n</tr>\n<tr>\n<td><code>batch_size</code></td>\n<td>每次处理的数据条数</td>\n<td><code>None</code>（由框架自动决定）</td>\n<td><code>64</code></td>\n</tr>\n<tr>\n<td><code>concurrency</code></td>\n<td>同时运行的实例数</td>\n<td><code>None</code>（由框架自动决定）</td>\n<td><code>8</code></td>\n</tr>\n<tr>\n<td><code>use_process</code></td>\n<td>是否使用进程隔离（CPU 密集时建议开启）</td>\n<td><code>False</code>（使用线程）</td>\n<td><code>True</code></td>\n</tr>\n</tbody>\n</table>\n<h3 id=\"数据读写方式\"><a href=\"#%E6%95%B0%E6%8D%AE%E8%AF%BB%E5%86%99%E6%96%B9%E5%BC%8F\" aria-label=\"数据读写方式 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>数据读写方式</h3>\n<p>支持本地文件/挂载目录文件/BOS/HTTP等多种方式: </p>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">Python</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-python\"><code><span class=\"line-number\">1</span>    samples <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n<span class=\"line-number\">2</span>        <span class=\"token string\">\"image\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span>\n<span class=\"line-number\">3</span>            <span class=\"token string\">\"file:///local/sample_1.jpg\"</span><span class=\"token punctuation\">,</span>      <span class=\"token comment\">#本地数据</span>\n<span class=\"line-number\">4</span>            <span class=\"token string\">\"file:///mnt/pfs/sample_2.jpg\"</span><span class=\"token punctuation\">,</span>    <span class=\"token comment\">#pfs挂载点内数据</span>\n<span class=\"line-number\">5</span>            <span class=\"token string\">\"file:///mnt/bos/sample_3.jpg\"</span><span class=\"token punctuation\">,</span>    <span class=\"token comment\">#bos挂载点内数据</span>\n<span class=\"line-number\">6</span>            <span class=\"token string\">\"bos://bucket/path/sample_4.jpg\"</span>   <span class=\"token comment\">#bos直接抓取数据</span>\n<span class=\"line-number\">7</span>            <span class=\"token string\">\"http://url/sample_5.jpg\"</span>          <span class=\"token comment\">#http抓取数据</span>\n<span class=\"line-number\">8</span>        <span class=\"token punctuation\">]</span>\n<span class=\"line-number\">9</span>    <span class=\"token punctuation\">}</span></code></pre>\n            </div>\n        </div>\n    </div>\n  \n<p>若使用BOS直接抓取数据的方式, 需要在数据处理代码中注入BOS相关环境变量, 如下：</p>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">Python</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-python\"><code><span class=\"line-number\">1</span>os<span class=\"token punctuation\">.</span>environ<span class=\"token punctuation\">[</span><span class=\"token string\">\"BOS_ENDPOINT\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token string\">\"http://bj.bcebos.com\"</span>     <span class=\"token comment\">#endpoint</span>\n<span class=\"line-number\">2</span>os<span class=\"token punctuation\">.</span>environ<span class=\"token punctuation\">[</span><span class=\"token string\">\"BOS_ACCESS_KEY_ID\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token string\">\"\"</span>                    \n<span class=\"line-number\">3</span>os<span class=\"token punctuation\">.</span>environ<span class=\"token punctuation\">[</span><span class=\"token string\">\"BOS_SECRET_ACCESS_KEY\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token string\">\"\"</span>\n<span class=\"line-number\">4</span>os<span class=\"token punctuation\">.</span>environ<span class=\"token punctuation\">[</span><span class=\"token string\">\"BOS_REGION\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token string\">\"bj\"</span></code></pre>\n            </div>\n        </div>\n    </div>\n  \n<h2 id=\"最佳实践\"><a href=\"#%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5\" aria-label=\"最佳实践 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>最佳实践</h2>\n<p>这里我们以具身数据格式转换为例，指导用户基于百舸平台的数据处理算子，通过开发机/分布式训练任务，实现 LerobotV2.1 数据集向 V3.0 版本的格式转换。准备工作</p>\n<h3 id=\"环境准备\"><a href=\"#%E7%8E%AF%E5%A2%83%E5%87%86%E5%A4%87\" aria-label=\"环境准备 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>环境准备</h3>\n<ol>\n<li>这里我们可以使用开发机来开发调试代码，关于aihc-daft集成，你可以直接使用 aihc-daft 的镜像启动开发机，也可以使用自己的镜像启动，手动安装 aihc-daft 的包</li>\n<li>\n<p>数据准备</p>\n<ul>\n<li>这里我们以huggingface中开源测试数据集lerobot/pusht/ 、dataset/lerobot/pusht2/和 lerobot/aloha_sim_insertion_human/ 为例</li>\n<li>\n<p>我们提供了打包的数据集，<a href=\"http://bj.bcebos.com/test-bj-h1/test_dataset.tar.gz\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">点击下载</a>。 原始数据集分布为: </p>\n<ul>\n<li>dataset/lerobot/aloha_sim_insertion_human/</li>\n<li>dataset/lerobot/pusht/</li>\n<li>dataset/lerobot/pusht2/</li>\n</ul>\n</li>\n</ul>\n</li>\n</ol>\n<h3 id=\"算子开发\"><a href=\"#%E7%AE%97%E5%AD%90%E5%BC%80%E5%8F%91\" aria-label=\"算子开发 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>算子开发</h3>\n<ol>\n<li>这里我们基于原始的<a href=\"https://cloud.baidu.com/doc/AIHC/s/3moa0aug1\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">TarUncompress</a>算子，实现了递归目录下Tar 文件解压等能力。脚本如下：</li>\n</ol>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">Python</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-python\"><code><span class=\"line-number\">1</span><span class=\"token keyword\">import</span> json\n<span class=\"line-number\">2</span><span class=\"token keyword\">import</span> os\n<span class=\"line-number\">3</span><span class=\"token keyword\">import</span> tarfile\n<span class=\"line-number\">4</span><span class=\"token keyword\">import</span> daft\n<span class=\"line-number\">5</span><span class=\"token keyword\">from</span> daft <span class=\"token keyword\">import</span> col\n<span class=\"line-number\">6</span>\n<span class=\"line-number\">7</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>common<span class=\"token punctuation\">.</span>udf <span class=\"token keyword\">import</span> aihc_udf\n<span class=\"line-number\">8</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>functions<span class=\"token punctuation\">.</span>process<span class=\"token punctuation\">.</span>tar_extractor_udf <span class=\"token keyword\">import</span> TarUncompress\n<span class=\"line-number\">9</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>functions<span class=\"token punctuation\">.</span>process<span class=\"token punctuation\">.</span>tar_extractor_udf <span class=\"token keyword\">import</span> discover_datasets\n<span class=\"line-number\">10</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>functions<span class=\"token punctuation\">.</span>process<span class=\"token punctuation\">.</span>tar_extractor_udf <span class=\"token keyword\">import</span> create_tasks_from_datasets\n<span class=\"line-number\">11</span>\n<span class=\"line-number\">12</span>TAR_EXTENSIONS <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span><span class=\"token string\">\".tar\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar.gz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tgz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar.bz2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tbz2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar.xz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".txz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar.zst\"</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">13</span>\n<span class=\"line-number\">14</span>\n<span class=\"line-number\">15</span><span class=\"token keyword\">def</span> <span class=\"token function\">is_tar_file</span><span class=\"token punctuation\">(</span>filepath<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">bool</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">16</span>    lower <span class=\"token operator\">=</span> filepath<span class=\"token punctuation\">.</span>lower<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">17</span>    <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> <span class=\"token builtin\">any</span><span class=\"token punctuation\">(</span>lower<span class=\"token punctuation\">.</span>endswith<span class=\"token punctuation\">(</span>ext<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> ext <span class=\"token keyword\">in</span> TAR_EXTENSIONS<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">18</span>        <span class=\"token keyword\">return</span> <span class=\"token boolean\">False</span>\n<span class=\"line-number\">19</span>    <span class=\"token keyword\">return</span> tarfile<span class=\"token punctuation\">.</span>is_tarfile<span class=\"token punctuation\">(</span>filepath<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">20</span>\n<span class=\"line-number\">21</span>\n<span class=\"line-number\">22</span><span class=\"token keyword\">def</span> <span class=\"token function\">find_tar_files</span><span class=\"token punctuation\">(</span>directory<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">list</span><span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">23</span>    tar_files <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n<span class=\"line-number\">24</span>    <span class=\"token keyword\">for</span> root<span class=\"token punctuation\">,</span> _dirs<span class=\"token punctuation\">,</span> files <span class=\"token keyword\">in</span> os<span class=\"token punctuation\">.</span>walk<span class=\"token punctuation\">(</span>directory<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">25</span>        <span class=\"token keyword\">for</span> f <span class=\"token keyword\">in</span> files<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">26</span>            full <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>root<span class=\"token punctuation\">,</span> f<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">27</span>            <span class=\"token keyword\">if</span> is_tar_file<span class=\"token punctuation\">(</span>full<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">28</span>                tar_files<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>full<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">29</span>    <span class=\"token keyword\">return</span> tar_files\n<span class=\"line-number\">30</span>\n<span class=\"line-number\">31</span>\n<span class=\"line-number\">32</span><span class=\"token keyword\">def</span> <span class=\"token function\">safe_members</span><span class=\"token punctuation\">(</span>tf<span class=\"token punctuation\">:</span> tarfile<span class=\"token punctuation\">.</span>TarFile<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">list</span><span class=\"token punctuation\">[</span>tarfile<span class=\"token punctuation\">.</span>TarInfo<span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">33</span>    <span class=\"token keyword\">return</span> <span class=\"token punctuation\">[</span>m <span class=\"token keyword\">for</span> m <span class=\"token keyword\">in</span> tf<span class=\"token punctuation\">.</span>getmembers<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> m<span class=\"token punctuation\">.</span>name<span class=\"token punctuation\">.</span>startswith<span class=\"token punctuation\">(</span><span class=\"token string\">\"/\"</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">and</span> <span class=\"token string\">\"..\"</span> <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> m<span class=\"token punctuation\">.</span>name<span class=\"token punctuation\">]</span>\n<span class=\"line-number\">34</span>\n<span class=\"line-number\">35</span>\n<span class=\"line-number\">36</span><span class=\"token keyword\">def</span> <span class=\"token function\">extract_recursive</span><span class=\"token punctuation\">(</span>tar_path<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">,</span> output_dir<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">list</span><span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">37</span>    all_extracted <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n<span class=\"line-number\">38</span>\n<span class=\"line-number\">39</span>    <span class=\"token comment\"># 第一次解压</span>\n<span class=\"line-number\">40</span>    <span class=\"token keyword\">with</span> tarfile<span class=\"token punctuation\">.</span><span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>tar_path<span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> tf<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">41</span>        members <span class=\"token operator\">=</span> safe_members<span class=\"token punctuation\">(</span>tf<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">42</span>        tf<span class=\"token punctuation\">.</span>extractall<span class=\"token punctuation\">(</span>path<span class=\"token operator\">=</span>output_dir<span class=\"token punctuation\">,</span> members<span class=\"token operator\">=</span>members<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">43</span>        all_extracted<span class=\"token punctuation\">.</span>extend<span class=\"token punctuation\">(</span><span class=\"token punctuation\">[</span>os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>output_dir<span class=\"token punctuation\">,</span> m<span class=\"token punctuation\">.</span>name<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> m <span class=\"token keyword\">in</span> members<span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">44</span>\n<span class=\"line-number\">45</span>    <span class=\"token comment\"># 持续扫描并解压新出现的 tar 文件</span>\n<span class=\"line-number\">46</span>    pending <span class=\"token operator\">=</span> find_tar_files<span class=\"token punctuation\">(</span>output_dir<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">47</span>    processed <span class=\"token operator\">=</span> <span class=\"token builtin\">set</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">48</span>\n<span class=\"line-number\">49</span>    <span class=\"token keyword\">while</span> pending<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">50</span>        current <span class=\"token operator\">=</span> pending<span class=\"token punctuation\">.</span>pop<span class=\"token punctuation\">(</span><span class=\"token number\">0</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">51</span>        <span class=\"token keyword\">if</span> current <span class=\"token keyword\">in</span> processed<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">52</span>            <span class=\"token keyword\">continue</span>\n<span class=\"line-number\">53</span>        processed<span class=\"token punctuation\">.</span>add<span class=\"token punctuation\">(</span>current<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">54</span>\n<span class=\"line-number\">55</span>        extract_dir <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>dirname<span class=\"token punctuation\">(</span>current<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">56</span>        <span class=\"token keyword\">with</span> tarfile<span class=\"token punctuation\">.</span><span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>current<span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> tf<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">57</span>            members <span class=\"token operator\">=</span> safe_members<span class=\"token punctuation\">(</span>tf<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">58</span>            tf<span class=\"token punctuation\">.</span>extractall<span class=\"token punctuation\">(</span>path<span class=\"token operator\">=</span>extract_dir<span class=\"token punctuation\">,</span> members<span class=\"token operator\">=</span>members<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">59</span>            all_extracted<span class=\"token punctuation\">.</span>extend<span class=\"token punctuation\">(</span><span class=\"token punctuation\">[</span>os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>extract_dir<span class=\"token punctuation\">,</span> m<span class=\"token punctuation\">.</span>name<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> m <span class=\"token keyword\">in</span> members<span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">60</span>\n<span class=\"line-number\">61</span>        <span class=\"token comment\"># 删除已解压的内层 tar 包（如需保留，注释掉下面这行）</span>\n<span class=\"line-number\">62</span>        <span class=\"token comment\"># os.remove(current)</span>\n<span class=\"line-number\">63</span>\n<span class=\"line-number\">64</span>        new_tars <span class=\"token operator\">=</span> find_tar_files<span class=\"token punctuation\">(</span>output_dir<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">65</span>        <span class=\"token keyword\">for</span> t <span class=\"token keyword\">in</span> new_tars<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">66</span>            <span class=\"token keyword\">if</span> t <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> processed <span class=\"token keyword\">and</span> t <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> pending<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">67</span>                pending<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>t<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">68</span>\n<span class=\"line-number\">69</span>    <span class=\"token keyword\">return</span> all_extracted\n<span class=\"line-number\">70</span>\n<span class=\"line-number\">71</span>\n<span class=\"line-number\">72</span><span class=\"token keyword\">class</span> <span class=\"token class-name\">RecursiveTarUncompress</span><span class=\"token punctuation\">(</span>TarUncompress<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">73</span>    <span class=\"token triple-quoted-string string\">\"\"\"递归解压多层嵌套 tar 包的 UDF。\"\"\"</span>\n<span class=\"line-number\">74</span>\n<span class=\"line-number\">75</span>    <span class=\"token keyword\">def</span> <span class=\"token function\">__call__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> input_path<span class=\"token punctuation\">,</span> output_path<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">76</span>        input_list <span class=\"token operator\">=</span> input_path<span class=\"token punctuation\">.</span>to_pylist<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">77</span>        output_list <span class=\"token operator\">=</span> output_path<span class=\"token punctuation\">.</span>to_pylist<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">78</span>\n<span class=\"line-number\">79</span>        results <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n<span class=\"line-number\">80</span>        <span class=\"token keyword\">for</span> inp<span class=\"token punctuation\">,</span> outp <span class=\"token keyword\">in</span> <span class=\"token builtin\">zip</span><span class=\"token punctuation\">(</span>input_list<span class=\"token punctuation\">,</span> output_list<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">81</span>            os<span class=\"token punctuation\">.</span>makedirs<span class=\"token punctuation\">(</span>outp<span class=\"token punctuation\">,</span> exist_ok<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">82</span>            all_files <span class=\"token operator\">=</span> extract_recursive<span class=\"token punctuation\">(</span>inp<span class=\"token punctuation\">,</span> outp<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">83</span>            <span class=\"token comment\"># 返回 JSON 字符串，与父类 TarUncompress 的 __return_column_type__(String) 一致</span>\n<span class=\"line-number\">84</span>            results<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>json<span class=\"token punctuation\">.</span>dumps<span class=\"token punctuation\">(</span><span class=\"token punctuation\">{</span>\n<span class=\"line-number\">85</span>                <span class=\"token string\">\"status\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"success\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">86</span>                <span class=\"token string\">\"input\"</span><span class=\"token punctuation\">:</span> inp<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">87</span>                <span class=\"token string\">\"output\"</span><span class=\"token punctuation\">:</span> outp<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">88</span>                <span class=\"token string\">\"extracted_files\"</span><span class=\"token punctuation\">:</span> all_files<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">89</span>                <span class=\"token string\">\"extracted_count\"</span><span class=\"token punctuation\">:</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>all_files<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">90</span>            <span class=\"token punctuation\">}</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">91</span>        <span class=\"token keyword\">return</span> results\n<span class=\"line-number\">92</span>\n<span class=\"line-number\">93</span>\n<span class=\"line-number\">94</span><span class=\"token keyword\">if</span> __name__ <span class=\"token operator\">==</span> <span class=\"token string\">\"__main__\"</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">95</span>    <span class=\"token keyword\">if</span> os<span class=\"token punctuation\">.</span>getenv<span class=\"token punctuation\">(</span><span class=\"token string\">\"DAFT_RUNNER\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"native\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token string\">\"ray\"</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">96</span>        <span class=\"token keyword\">import</span> ray\n<span class=\"line-number\">97</span>        ray<span class=\"token punctuation\">.</span>init<span class=\"token punctuation\">(</span>dashboard_host<span class=\"token operator\">=</span><span class=\"token string\">\"0.0.0.0\"</span><span class=\"token punctuation\">,</span> ignore_reinit_error<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">98</span>        daft<span class=\"token punctuation\">.</span>set_runner_ray<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">99</span>    daft<span class=\"token punctuation\">.</span>set_execution_config<span class=\"token punctuation\">(</span>actor_udf_ready_timeout<span class=\"token operator\">=</span><span class=\"token number\">6000</span><span class=\"token punctuation\">,</span> min_cpu_per_task<span class=\"token operator\">=</span><span class=\"token number\">0</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">100</span>\n<span class=\"line-number\">101</span>    base_path <span class=\"token operator\">=</span> <span class=\"token string\">\"/mnt/pfs/xx\"</span>  <span class=\"token comment\"># 【用户需替换】实际存放tar包的目录</span>\n<span class=\"line-number\">102</span>\n<span class=\"line-number\">103</span>    <span class=\"token comment\"># 直接扫描 base_path 下的 tar 文件</span>\n<span class=\"line-number\">104</span>    tar_files <span class=\"token operator\">=</span> find_tar_files<span class=\"token punctuation\">(</span>base_path<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">105</span>    <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> tar_files<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">106</span>        <span class=\"token keyword\">raise</span> ValueError<span class=\"token punctuation\">(</span><span class=\"token string-interpolation\"><span class=\"token string\">f\"未在 </span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>base_path<span class=\"token punctuation\">}</span></span><span class=\"token string\"> 下发现任何 tar 文件，请检查路径\"</span></span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">107</span>\n<span class=\"line-number\">108</span>    <span class=\"token comment\"># 去除所有 tar 后缀作为输出目录，如 test_dataset.tar.gz -> test_dataset</span>\n<span class=\"line-number\">109</span>    <span class=\"token keyword\">def</span> <span class=\"token function\">strip_tar_ext</span><span class=\"token punctuation\">(</span>path<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">110</span>        <span class=\"token keyword\">while</span> <span class=\"token boolean\">True</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">111</span>            base<span class=\"token punctuation\">,</span> ext <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>splitext<span class=\"token punctuation\">(</span>path<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">112</span>            <span class=\"token keyword\">if</span> ext<span class=\"token punctuation\">.</span>lower<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">in</span> <span class=\"token punctuation\">(</span><span class=\"token string\">\".gz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".bz2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".xz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".zst\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tgz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tbz2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".txz\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">113</span>                path <span class=\"token operator\">=</span> base\n<span class=\"line-number\">114</span>            <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">115</span>                <span class=\"token keyword\">break</span>\n<span class=\"line-number\">116</span>        <span class=\"token keyword\">return</span> path\n<span class=\"line-number\">117</span>\n<span class=\"line-number\">118</span>    tasks <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n<span class=\"line-number\">119</span>        <span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">:</span> tar_files<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">120</span>        <span class=\"token string\">\"output_path\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span>strip_tar_ext<span class=\"token punctuation\">(</span>t<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> t <span class=\"token keyword\">in</span> tar_files<span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">121</span>    <span class=\"token punctuation\">}</span>\n<span class=\"line-number\">122</span>\n<span class=\"line-number\">123</span>    num_tasks <span class=\"token operator\">=</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>tasks<span class=\"token punctuation\">[</span><span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">124</span>    concurrency <span class=\"token operator\">=</span> <span class=\"token builtin\">max</span><span class=\"token punctuation\">(</span>num_tasks<span class=\"token punctuation\">,</span> <span class=\"token number\">1</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">125</span>\n<span class=\"line-number\">126</span>    ds <span class=\"token operator\">=</span> daft<span class=\"token punctuation\">.</span>from_pydict<span class=\"token punctuation\">(</span>tasks<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">127</span>    ds <span class=\"token operator\">=</span> ds<span class=\"token punctuation\">.</span>into_partitions<span class=\"token punctuation\">(</span>num_tasks<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">128</span>\n<span class=\"line-number\">129</span>    ds <span class=\"token operator\">=</span> ds<span class=\"token punctuation\">.</span>with_column<span class=\"token punctuation\">(</span>\n<span class=\"line-number\">130</span>        <span class=\"token string\">\"result\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">131</span>        aihc_udf<span class=\"token punctuation\">(</span>\n<span class=\"line-number\">132</span>            RecursiveTarUncompress<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">133</span>            construct_args<span class=\"token operator\">=</span><span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">134</span>            num_cpus<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">135</span>            num_gpus<span class=\"token operator\">=</span><span class=\"token number\">0</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">136</span>            batch_size<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">137</span>            concurrency<span class=\"token operator\">=</span>concurrency<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">138</span>            use_process<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">139</span>        <span class=\"token punctuation\">)</span><span class=\"token punctuation\">(</span>col<span class=\"token punctuation\">(</span><span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> col<span class=\"token punctuation\">(</span><span class=\"token string\">\"output_path\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">140</span>    <span class=\"token punctuation\">)</span>\n<span class=\"line-number\">141</span>    ds<span class=\"token punctuation\">.</span>show<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span></code></pre>\n            </div>\n        </div>\n    </div>\n  \n<ol start=\"2\">\n<li>基于<a href=\"https://cloud.baidu.com/doc/AIHC/s/Wmoa08fe4\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">ConvertDatasetV21ToV30</a>算子，将lerobotV2.1数据集格式 转换为 lerobotV3.0数据集格式，脚本如下：</li>\n</ol>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">Python</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-python\"><code><span class=\"line-number\">1</span><span class=\"token keyword\">import</span> os\n<span class=\"line-number\">2</span><span class=\"token keyword\">import</span> daft\n<span class=\"line-number\">3</span><span class=\"token keyword\">from</span> daft <span class=\"token keyword\">import</span> col\n<span class=\"line-number\">4</span>\n<span class=\"line-number\">5</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>common<span class=\"token punctuation\">.</span>udf <span class=\"token keyword\">import</span> aihc_udf\n<span class=\"line-number\">6</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>functions<span class=\"token punctuation\">.</span>embodied<span class=\"token punctuation\">.</span>convert_dataset_v21_to_v30_udf <span class=\"token keyword\">import</span> ConvertDatasetV21ToV30\n<span class=\"line-number\">7</span>\n<span class=\"line-number\">8</span><span class=\"token keyword\">if</span> __name__ <span class=\"token operator\">==</span> <span class=\"token string\">\"__main__\"</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">9</span>    <span class=\"token keyword\">if</span> os<span class=\"token punctuation\">.</span>getenv<span class=\"token punctuation\">(</span><span class=\"token string\">\"DAFT_RUNNER\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"native\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token string\">\"ray\"</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">10</span>        <span class=\"token keyword\">import</span> ray\n<span class=\"line-number\">11</span>        ray<span class=\"token punctuation\">.</span>init<span class=\"token punctuation\">(</span>dashboard_host<span class=\"token operator\">=</span><span class=\"token string\">\"0.0.0.0\"</span><span class=\"token punctuation\">,</span> ignore_reinit_error<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">12</span>        daft<span class=\"token punctuation\">.</span>set_runner_ray<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">13</span>    daft<span class=\"token punctuation\">.</span>set_execution_config<span class=\"token punctuation\">(</span>actor_udf_ready_timeout<span class=\"token operator\">=</span><span class=\"token number\">6000</span><span class=\"token punctuation\">,</span> min_cpu_per_task<span class=\"token operator\">=</span><span class=\"token number\">0</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">14</span>\n<span class=\"line-number\">15</span>    tasks <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n<span class=\"line-number\">16</span>        <span class=\"token string\">\"input_repoid\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span>\n<span class=\"line-number\">17</span>            <span class=\"token string\">\"lerobot/aloha_sim_insertion_human/\"</span><span class=\"token punctuation\">,</span> \n<span class=\"line-number\">18</span>            <span class=\"token string\">\"lerobot/pusht/\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">19</span>            <span class=\"token string\">\"lerobot/pusht2/\"</span>   \n<span class=\"line-number\">20</span>        <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">21</span>        <span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span><span class=\"token string\">\"/mnt/pfs/xx/test_dataset/dataset/\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">*</span> <span class=\"token number\">3</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">22</span>        <span class=\"token string\">\"output_path\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span><span class=\"token string\">\"/mnt/pfs/xx/lerobotv3\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">*</span> <span class=\"token number\">3</span>  <span class=\"token comment\"># 【用户需替换】格式转换后的输出目录</span>\n<span class=\"line-number\">23</span>    <span class=\"token punctuation\">}</span>\n<span class=\"line-number\">24</span>    num_datasets <span class=\"token operator\">=</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>tasks<span class=\"token punctuation\">[</span><span class=\"token string\">\"input_repoid\"</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">25</span>    ds <span class=\"token operator\">=</span> daft<span class=\"token punctuation\">.</span>from_pydict<span class=\"token punctuation\">(</span>tasks<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>into_partitions<span class=\"token punctuation\">(</span>num_datasets<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">26</span>\n<span class=\"line-number\">27</span>    ds <span class=\"token operator\">=</span> ds<span class=\"token punctuation\">.</span>with_column<span class=\"token punctuation\">(</span>\n<span class=\"line-number\">28</span>        <span class=\"token string\">\"convert_result\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">29</span>        aihc_udf<span class=\"token punctuation\">(</span>\n<span class=\"line-number\">30</span>            ConvertDatasetV21ToV30<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">31</span>            construct_args<span class=\"token operator\">=</span><span class=\"token punctuation\">{</span>\n<span class=\"line-number\">32</span>            <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">33</span>            num_cpus<span class=\"token operator\">=</span><span class=\"token number\">0.1</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">34</span>            batch_size<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">35</span>            concurrency<span class=\"token operator\">=</span>num_datasets<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">36</span>            use_process<span class=\"token operator\">=</span><span class=\"token boolean\">True</span>\n<span class=\"line-number\">37</span>        <span class=\"token punctuation\">)</span><span class=\"token punctuation\">(</span>col<span class=\"token punctuation\">(</span><span class=\"token string\">\"input_repoid\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> col<span class=\"token punctuation\">(</span><span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> col<span class=\"token punctuation\">(</span><span class=\"token string\">\"output_path\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">38</span>    <span class=\"token punctuation\">)</span>\n<span class=\"line-number\">39</span>    ds<span class=\"token punctuation\">.</span>show<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span></code></pre>\n            </div>\n        </div>\n    </div>\n  \n<p>整体处理流程: </p>\n<p>运行<code>pipiline.py</code></p>\n\n    <div class=\"code-block-wrapper\">\n        <div class=\"code-block\">\n            <div class=\"code-block-header\">\n                <span class=\"code-block-name\">Python</span>\n                <button class=\"code-copy-btn\" data-tooltip-text=\"\">\n                    <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" fill=\"none\"> <path fill-rule=\"evenodd\" clip-rule=\"evenodd\" d=\"M5.57894 3.45614C5.57894 3.38832 5.63392 3.33333 5.70175 3.33333H12.5439C12.6117 3.33333 12.6667 3.38832 12.6667 3.45614V10.2982C12.6667 10.3661 12.6117 10.4211 12.5439 10.4211H11.7544V5.70175C11.7544 4.89754 11.1025 4.24561 10.2982 4.24561H5.57894V3.45614ZM4.24561 4.24561V3.45614C4.24561 2.65194 4.89754 2 5.70175 2H12.5439C13.3481 2 14 2.65194 14 3.45614V10.2982C14 11.1025 13.3481 11.7544 12.5439 11.7544H11.7544V12.5439C11.7544 13.3481 11.1025 14 10.2982 14H3.45614C2.65194 14 2 13.3481 2 12.5439V5.70175C2 4.89754 2.65194 4.24561 3.45614 4.24561H4.24561ZM3.33333 5.70175C3.33333 5.63392 3.38832 5.57894 3.45614 5.57894H10.2982C10.3661 5.57894 10.4211 5.63392 10.4211 5.70175V12.5439C10.4211 12.6117 10.3661 12.6667 10.2982 12.6667H3.45614C3.38832 12.6667 3.33333 12.6117 3.33333 12.5439V5.70175Z\" fill=\"currentColor\"></path> </svg>\n                    复制\n                </button>\n            </div>\n            <div class=\"code-block-content\">\n                <pre class=\"language-python\"><code><span class=\"line-number\">1</span><span class=\"token keyword\">import</span> json\n<span class=\"line-number\">2</span><span class=\"token keyword\">import</span> os\n<span class=\"line-number\">3</span><span class=\"token keyword\">import</span> tarfile\n<span class=\"line-number\">4</span><span class=\"token keyword\">import</span> daft\n<span class=\"line-number\">5</span><span class=\"token keyword\">from</span> daft <span class=\"token keyword\">import</span> col\n<span class=\"line-number\">6</span>\n<span class=\"line-number\">7</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>common<span class=\"token punctuation\">.</span>udf <span class=\"token keyword\">import</span> aihc_udf\n<span class=\"line-number\">8</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>functions<span class=\"token punctuation\">.</span>process<span class=\"token punctuation\">.</span>tar_extractor_udf <span class=\"token keyword\">import</span> TarUncompress\n<span class=\"line-number\">9</span><span class=\"token keyword\">from</span> daft<span class=\"token punctuation\">.</span>aihc<span class=\"token punctuation\">.</span>functions<span class=\"token punctuation\">.</span>embodied<span class=\"token punctuation\">.</span>convert_dataset_v21_to_v30_udf <span class=\"token keyword\">import</span> ConvertDatasetV21ToV30\n<span class=\"line-number\">10</span>\n<span class=\"line-number\">11</span><span class=\"token comment\"># ====================== 直接复用 data_convert.py 全部代码 ======================</span>\n<span class=\"line-number\">12</span>TAR_EXTENSIONS <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span><span class=\"token string\">\".tar\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar.gz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tgz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar.bz2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tbz2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar.xz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".txz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar.zst\"</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">13</span>\n<span class=\"line-number\">14</span><span class=\"token keyword\">def</span> <span class=\"token function\">is_tar_file</span><span class=\"token punctuation\">(</span>filepath<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">bool</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">15</span>    lower <span class=\"token operator\">=</span> filepath<span class=\"token punctuation\">.</span>lower<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">16</span>    <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> <span class=\"token builtin\">any</span><span class=\"token punctuation\">(</span>lower<span class=\"token punctuation\">.</span>endswith<span class=\"token punctuation\">(</span>ext<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> ext <span class=\"token keyword\">in</span> TAR_EXTENSIONS<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">17</span>        <span class=\"token keyword\">return</span> <span class=\"token boolean\">False</span>\n<span class=\"line-number\">18</span>    <span class=\"token keyword\">return</span> tarfile<span class=\"token punctuation\">.</span>is_tarfile<span class=\"token punctuation\">(</span>filepath<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">19</span>\n<span class=\"line-number\">20</span><span class=\"token keyword\">def</span> <span class=\"token function\">find_tar_files</span><span class=\"token punctuation\">(</span>directory<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">list</span><span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">21</span>    tar_files <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n<span class=\"line-number\">22</span>    <span class=\"token keyword\">for</span> root<span class=\"token punctuation\">,</span> _dirs<span class=\"token punctuation\">,</span> files <span class=\"token keyword\">in</span> os<span class=\"token punctuation\">.</span>walk<span class=\"token punctuation\">(</span>directory<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">23</span>        <span class=\"token keyword\">for</span> f <span class=\"token keyword\">in</span> files<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">24</span>            full <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>root<span class=\"token punctuation\">,</span> f<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">25</span>            <span class=\"token keyword\">if</span> is_tar_file<span class=\"token punctuation\">(</span>full<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">26</span>                tar_files<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>full<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">27</span>    <span class=\"token keyword\">return</span> tar_files\n<span class=\"line-number\">28</span>\n<span class=\"line-number\">29</span><span class=\"token keyword\">def</span> <span class=\"token function\">safe_members</span><span class=\"token punctuation\">(</span>tf<span class=\"token punctuation\">:</span> tarfile<span class=\"token punctuation\">.</span>TarFile<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">list</span><span class=\"token punctuation\">[</span>tarfile<span class=\"token punctuation\">.</span>TarInfo<span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">30</span>    <span class=\"token keyword\">return</span> <span class=\"token punctuation\">[</span>m <span class=\"token keyword\">for</span> m <span class=\"token keyword\">in</span> tf<span class=\"token punctuation\">.</span>getmembers<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> m<span class=\"token punctuation\">.</span>name<span class=\"token punctuation\">.</span>startswith<span class=\"token punctuation\">(</span><span class=\"token string\">\"/\"</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">and</span> <span class=\"token string\">\"..\"</span> <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> m<span class=\"token punctuation\">.</span>name<span class=\"token punctuation\">]</span>\n<span class=\"line-number\">31</span>\n<span class=\"line-number\">32</span><span class=\"token keyword\">def</span> <span class=\"token function\">extract_recursive</span><span class=\"token punctuation\">(</span>tar_path<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">,</span> output_dir<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">list</span><span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">33</span>    all_extracted <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n<span class=\"line-number\">34</span>    <span class=\"token keyword\">with</span> tarfile<span class=\"token punctuation\">.</span><span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>tar_path<span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> tf<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">35</span>        members <span class=\"token operator\">=</span> safe_members<span class=\"token punctuation\">(</span>tf<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">36</span>        tf<span class=\"token punctuation\">.</span>extractall<span class=\"token punctuation\">(</span>path<span class=\"token operator\">=</span>output_dir<span class=\"token punctuation\">,</span> members<span class=\"token operator\">=</span>members<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">37</span>        all_extracted<span class=\"token punctuation\">.</span>extend<span class=\"token punctuation\">(</span><span class=\"token punctuation\">[</span>os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>output_dir<span class=\"token punctuation\">,</span> m<span class=\"token punctuation\">.</span>name<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> m <span class=\"token keyword\">in</span> members<span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">38</span>\n<span class=\"line-number\">39</span>    pending <span class=\"token operator\">=</span> find_tar_files<span class=\"token punctuation\">(</span>output_dir<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">40</span>    processed <span class=\"token operator\">=</span> <span class=\"token builtin\">set</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">41</span>\n<span class=\"line-number\">42</span>    <span class=\"token keyword\">while</span> pending<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">43</span>        current <span class=\"token operator\">=</span> pending<span class=\"token punctuation\">.</span>pop<span class=\"token punctuation\">(</span><span class=\"token number\">0</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">44</span>        <span class=\"token keyword\">if</span> current <span class=\"token keyword\">in</span> processed<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">45</span>            <span class=\"token keyword\">continue</span>\n<span class=\"line-number\">46</span>        processed<span class=\"token punctuation\">.</span>add<span class=\"token punctuation\">(</span>current<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">47</span>\n<span class=\"line-number\">48</span>        extract_dir <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>dirname<span class=\"token punctuation\">(</span>current<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">49</span>        <span class=\"token keyword\">with</span> tarfile<span class=\"token punctuation\">.</span><span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>current<span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> tf<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">50</span>            members <span class=\"token operator\">=</span> safe_members<span class=\"token punctuation\">(</span>tf<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">51</span>            tf<span class=\"token punctuation\">.</span>extractall<span class=\"token punctuation\">(</span>path<span class=\"token operator\">=</span>extract_dir<span class=\"token punctuation\">,</span> members<span class=\"token operator\">=</span>members<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">52</span>            all_extracted<span class=\"token punctuation\">.</span>extend<span class=\"token punctuation\">(</span><span class=\"token punctuation\">[</span>os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>extract_dir<span class=\"token punctuation\">,</span> m<span class=\"token punctuation\">.</span>name<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> m <span class=\"token keyword\">in</span> members<span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">53</span>\n<span class=\"line-number\">54</span>        new_tars <span class=\"token operator\">=</span> find_tar_files<span class=\"token punctuation\">(</span>output_dir<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">55</span>        <span class=\"token keyword\">for</span> t <span class=\"token keyword\">in</span> new_tars<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">56</span>            <span class=\"token keyword\">if</span> t <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> processed <span class=\"token keyword\">and</span> t <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> pending<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">57</span>                pending<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>t<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">58</span>\n<span class=\"line-number\">59</span>    <span class=\"token keyword\">return</span> all_extracted\n<span class=\"line-number\">60</span>\n<span class=\"line-number\">61</span><span class=\"token keyword\">class</span> <span class=\"token class-name\">RecursiveTarUncompress</span><span class=\"token punctuation\">(</span>TarUncompress<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">62</span>    <span class=\"token triple-quoted-string string\">\"\"\"递归解压多层嵌套 tar 包的 UDF。\"\"\"</span>\n<span class=\"line-number\">63</span>    <span class=\"token keyword\">def</span> <span class=\"token function\">__call__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> input_path<span class=\"token punctuation\">,</span> output_path<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">64</span>        input_list <span class=\"token operator\">=</span> input_path<span class=\"token punctuation\">.</span>to_pylist<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">65</span>        output_list <span class=\"token operator\">=</span> output_path<span class=\"token punctuation\">.</span>to_pylist<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">66</span>        results <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n<span class=\"line-number\">67</span>        <span class=\"token keyword\">for</span> inp<span class=\"token punctuation\">,</span> outp <span class=\"token keyword\">in</span> <span class=\"token builtin\">zip</span><span class=\"token punctuation\">(</span>input_list<span class=\"token punctuation\">,</span> output_list<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">68</span>            os<span class=\"token punctuation\">.</span>makedirs<span class=\"token punctuation\">(</span>outp<span class=\"token punctuation\">,</span> exist_ok<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">69</span>            all_files <span class=\"token operator\">=</span> extract_recursive<span class=\"token punctuation\">(</span>inp<span class=\"token punctuation\">,</span> outp<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">70</span>            results<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>json<span class=\"token punctuation\">.</span>dumps<span class=\"token punctuation\">(</span><span class=\"token punctuation\">{</span>\n<span class=\"line-number\">71</span>                <span class=\"token string\">\"status\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"success\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">72</span>                <span class=\"token string\">\"input\"</span><span class=\"token punctuation\">:</span> inp<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">73</span>                <span class=\"token string\">\"output\"</span><span class=\"token punctuation\">:</span> outp<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">74</span>                <span class=\"token string\">\"extracted_files\"</span><span class=\"token punctuation\">:</span> all_files<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">75</span>                <span class=\"token string\">\"extracted_count\"</span><span class=\"token punctuation\">:</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>all_files<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">76</span>            <span class=\"token punctuation\">}</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">77</span>        <span class=\"token keyword\">return</span> results\n<span class=\"line-number\">78</span>\n<span class=\"line-number\">79</span><span class=\"token comment\"># ====================== 主 pipeline 工作流 ======================</span>\n<span class=\"line-number\">80</span><span class=\"token keyword\">if</span> __name__ <span class=\"token operator\">==</span> <span class=\"token string\">\"__main__\"</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">81</span>    <span class=\"token comment\"># 统一环境初始化</span>\n<span class=\"line-number\">82</span>    <span class=\"token keyword\">if</span> os<span class=\"token punctuation\">.</span>getenv<span class=\"token punctuation\">(</span><span class=\"token string\">\"DAFT_RUNNER\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"native\"</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">==</span> <span class=\"token string\">\"ray\"</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">83</span>        <span class=\"token keyword\">import</span> ray\n<span class=\"line-number\">84</span>        ray<span class=\"token punctuation\">.</span>init<span class=\"token punctuation\">(</span>dashboard_host<span class=\"token operator\">=</span><span class=\"token string\">\"0.0.0.0\"</span><span class=\"token punctuation\">,</span> ignore_reinit_error<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">85</span>        daft<span class=\"token punctuation\">.</span>set_runner_ray<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">86</span>    daft<span class=\"token punctuation\">.</span>set_execution_config<span class=\"token punctuation\">(</span>actor_udf_ready_timeout<span class=\"token operator\">=</span><span class=\"token number\">6000</span><span class=\"token punctuation\">,</span> min_cpu_per_task<span class=\"token operator\">=</span><span class=\"token number\">0</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">87</span>\n<span class=\"line-number\">88</span>    base_path <span class=\"token operator\">=</span> <span class=\"token string\">\"/mnt/pfs/xx\"</span>\n<span class=\"line-number\">89</span>    convert_output_root <span class=\"token operator\">=</span> <span class=\"token string\">\"/mnt/pfs/xx/lerobotv3\"</span>\n<span class=\"line-number\">90</span>\n<span class=\"line-number\">91</span>    <span class=\"token comment\"># ====================== 步骤1：执行解压（原 data_convert.py） ======================</span>\n<span class=\"line-number\">92</span>    <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string\">\"=== 步骤1：开始递归解压 tar 文件 ===\"</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">93</span>    tar_files <span class=\"token operator\">=</span> find_tar_files<span class=\"token punctuation\">(</span>base_path<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">94</span>    <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> tar_files<span class=\"token punctuation\">:</span>\n<span class=\"line-number\">95</span>        <span class=\"token keyword\">raise</span> ValueError<span class=\"token punctuation\">(</span><span class=\"token string-interpolation\"><span class=\"token string\">f\"未在 </span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>base_path<span class=\"token punctuation\">}</span></span><span class=\"token string\"> 下发现任何 tar 文件\"</span></span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">96</span>\n<span class=\"line-number\">97</span>    <span class=\"token keyword\">def</span> <span class=\"token function\">strip_tar_ext</span><span class=\"token punctuation\">(</span>path<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">98</span>        <span class=\"token keyword\">while</span> <span class=\"token boolean\">True</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">99</span>            base<span class=\"token punctuation\">,</span> ext <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>splitext<span class=\"token punctuation\">(</span>path<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">100</span>            <span class=\"token keyword\">if</span> ext<span class=\"token punctuation\">.</span>lower<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">in</span> <span class=\"token punctuation\">(</span><span class=\"token string\">\".gz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".bz2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".xz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".zst\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tar\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tgz\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".tbz2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\".txz\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">101</span>                path <span class=\"token operator\">=</span> base\n<span class=\"line-number\">102</span>            <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n<span class=\"line-number\">103</span>                <span class=\"token keyword\">break</span>\n<span class=\"line-number\">104</span>        <span class=\"token keyword\">return</span> path\n<span class=\"line-number\">105</span>\n<span class=\"line-number\">106</span>    tasks_extract <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n<span class=\"line-number\">107</span>        <span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">:</span> tar_files<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">108</span>        <span class=\"token string\">\"output_path\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span>strip_tar_ext<span class=\"token punctuation\">(</span>t<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> t <span class=\"token keyword\">in</span> tar_files<span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">109</span>    <span class=\"token punctuation\">}</span>\n<span class=\"line-number\">110</span>    num_tasks <span class=\"token operator\">=</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>tasks_extract<span class=\"token punctuation\">[</span><span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">111</span>    concurrency <span class=\"token operator\">=</span> <span class=\"token builtin\">max</span><span class=\"token punctuation\">(</span>num_tasks<span class=\"token punctuation\">,</span> <span class=\"token number\">1</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">112</span>\n<span class=\"line-number\">113</span>    ds <span class=\"token operator\">=</span> daft<span class=\"token punctuation\">.</span>from_pydict<span class=\"token punctuation\">(</span>tasks_extract<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">114</span>    ds <span class=\"token operator\">=</span> ds<span class=\"token punctuation\">.</span>into_partitions<span class=\"token punctuation\">(</span>num_tasks<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">115</span>\n<span class=\"line-number\">116</span>    ds <span class=\"token operator\">=</span> ds<span class=\"token punctuation\">.</span>with_column<span class=\"token punctuation\">(</span>\n<span class=\"line-number\">117</span>        <span class=\"token string\">\"result\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">118</span>        aihc_udf<span class=\"token punctuation\">(</span>\n<span class=\"line-number\">119</span>            RecursiveTarUncompress<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">120</span>            construct_args<span class=\"token operator\">=</span><span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">121</span>            num_cpus<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">122</span>            num_gpus<span class=\"token operator\">=</span><span class=\"token number\">0</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">123</span>            batch_size<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">124</span>            concurrency<span class=\"token operator\">=</span>concurrency<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">125</span>            use_process<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">126</span>        <span class=\"token punctuation\">)</span><span class=\"token punctuation\">(</span>col<span class=\"token punctuation\">(</span><span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> col<span class=\"token punctuation\">(</span><span class=\"token string\">\"output_path\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">127</span>    <span class=\"token punctuation\">)</span>\n<span class=\"line-number\">128</span>    df_extract <span class=\"token operator\">=</span> ds<span class=\"token punctuation\">.</span>collect<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">129</span>    <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string\">\"=== 解压完成 ===\"</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">130</span>\n<span class=\"line-number\">131</span>    <span class=\"token comment\"># ====================== 步骤2：执行 v21 → v30 转换（原 lerobotv21-30.py） ======================</span>\n<span class=\"line-number\">132</span>    <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string\">\"=== 步骤2：开始格式转换 ===\"</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">133</span>    tasks_convert <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n<span class=\"line-number\">134</span>        <span class=\"token string\">\"input_repoid\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span>\n<span class=\"line-number\">135</span>            <span class=\"token string\">\"lerobot/aloha_sim_insertion_human/\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">136</span>            <span class=\"token string\">\"lerobot/pusht/\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">137</span>            <span class=\"token string\">\"lerobot/pusht2/\"</span>\n<span class=\"line-number\">138</span>        <span class=\"token punctuation\">]</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">139</span>        <span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span><span class=\"token string\">\"/mnt/pfs/xx/test_dataset/dataset/\"</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">*</span> <span class=\"token number\">3</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">140</span>        <span class=\"token string\">\"output_path\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">[</span>convert_output_root<span class=\"token punctuation\">]</span> <span class=\"token operator\">*</span> <span class=\"token number\">3</span>\n<span class=\"line-number\">141</span>    <span class=\"token punctuation\">}</span>\n<span class=\"line-number\">142</span>    num_datasets <span class=\"token operator\">=</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>tasks_convert<span class=\"token punctuation\">[</span><span class=\"token string\">\"input_repoid\"</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">143</span>\n<span class=\"line-number\">144</span>    ds_convert <span class=\"token operator\">=</span> daft<span class=\"token punctuation\">.</span>from_pydict<span class=\"token punctuation\">(</span>tasks_convert<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>into_partitions<span class=\"token punctuation\">(</span>num_datasets<span class=\"token punctuation\">)</span>\n<span class=\"line-number\">145</span>\n<span class=\"line-number\">146</span>    ds_convert <span class=\"token operator\">=</span> ds_convert<span class=\"token punctuation\">.</span>with_column<span class=\"token punctuation\">(</span>\n<span class=\"line-number\">147</span>        <span class=\"token string\">\"convert_result\"</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">148</span>        aihc_udf<span class=\"token punctuation\">(</span>\n<span class=\"line-number\">149</span>            ConvertDatasetV21ToV30<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">150</span>            construct_args<span class=\"token operator\">=</span><span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">151</span>            num_cpus<span class=\"token operator\">=</span><span class=\"token number\">0.1</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">152</span>            batch_size<span class=\"token operator\">=</span><span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">153</span>            concurrency<span class=\"token operator\">=</span>num_datasets<span class=\"token punctuation\">,</span>\n<span class=\"line-number\">154</span>            use_process<span class=\"token operator\">=</span><span class=\"token boolean\">True</span>\n<span class=\"line-number\">155</span>        <span class=\"token punctuation\">)</span><span class=\"token punctuation\">(</span>col<span class=\"token punctuation\">(</span><span class=\"token string\">\"input_repoid\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> col<span class=\"token punctuation\">(</span><span class=\"token string\">\"input_path\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> col<span class=\"token punctuation\">(</span><span class=\"token string\">\"output_path\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n<span class=\"line-number\">156</span>    <span class=\"token punctuation\">)</span>\n<span class=\"line-number\">157</span>    ds_convert<span class=\"token punctuation\">.</span>show<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"line-number\">158</span>    <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string\">\"=== 全部 pipeline 执行完成 ===\"</span><span class=\"token punctuation\">)</span></code></pre>\n            </div>\n        </div>\n    </div>\n  \n<h3 id=\"分布式数据处理\"><a href=\"#%E5%88%86%E5%B8%83%E5%BC%8F%E6%95%B0%E6%8D%AE%E5%A4%84%E7%90%86\" aria-label=\"分布式数据处理 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>分布式数据处理</h3>\n<p>在分布式训练模块中，基于上述开发的算子代码，使用 Ray 计算引擎进行分布式处理数据。</p>\n<blockquote>\n<p> 也可以直接使用开发机进行单机的数据处理</p>\n</blockquote>\n<p>Rayjob的提交可参考 <a href=\"https://cloud.baidu.com/doc/AIHC/s/kmnpx4yk6\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">快速提交Ray任务</a>，关键参数如下：</p>\n<ol>\n<li>镜像地址：使用百舸预置镜像，选择aihc-daft预置镜像</li>\n<li>执行命令：<code>DAFT_RUNNER=ray python /mnt/pfs/xx/pipeline.py</code>，其中<code>pipeline.py</code>是上面开发的算子代码。</li>\n<li>计算框架：选择 Ray</li>\n<li>计算资源：可设置多个 worker 实例并行执行。在任务执行时，<strong>Daft + Ray</strong> 会自动调度、自动负载均衡、用满集群资源。</li>\n<li>存储挂载：将源数据所在存储实例挂载到容器内</li>\n</ol>\n<p>提交任务即可进行数据处理，可通过 submitter 节点的日志查询数据处理的进度</p>\n<p><img src=\"https://bce.bdstatic.com/doc/bce-doc/AIHC/image-20260427230356334_56aea73.png\" alt=\"image-20260427230356334.png\"></p>","fields":{"slug":"smohxoure","title":"AIHC-Daft算子开发使用指南","date":"2026-04-28","extractedHeadings":[]},"headings":[{"value":"Daft 核心特性","depth":2},{"value":"集成aihc-daft方式","depth":2},{"value":"aihc-daft内置算子示例","depth":2},{"value":"aihc-daft 基础参数说明","depth":2},{"value":"aihc_udf 参数说明","depth":3},{"value":"数据读写方式","depth":3},{"value":"最佳实践","depth":2},{"value":"环境准备","depth":3},{"value":"算子开发","depth":3},{"value":"分布式数据处理","depth":3}]}},"pageContext":{"isCreatedByStatefulCreatePages":false,"slug":"smohxoure","prev":{"id":"3moa0aug1","name":"算子列表","path":"3moa0aug1","filePath":"操作指南/AI数据处理/算子列表/其他/Tar文件解压.md","seo":null,"parentIds":["ilib2qygp","Ymo88m8hi","Imob3m6so","emob46zqa"],"parents":[{"id":"ilib2qygp","documentId":"bfa43a8b-968a-41a1-8c9d-906507eeaed9","name":"操作指南","repoName":"AIHC","filePath":"操作指南","disabled":false,"path":"ilib2qygp","lastMergeTime":null,"isApiDoc":null,"httpMethod":null,"seo":null,"sourceOrgName":null,"sourceRepoName":null,"sourceDocumentId":null},{"id":"Ymo88m8hi","documentId":"c8cb5e38-f8c5-40f4-a424-b0c7895f0c0a","name":"AI数据处理","repoName":"AIHC","filePath":"操作指南/AI数据处理","disabled":false,"path":"Ymo88m8hi","lastMergeTime":"2026-04-21 14:23:10","isApiDoc":null,"httpMethod":null,"seo":null,"sourceOrgName":null,"sourceRepoName":null,"sourceDocumentId":""},{"id":"Imob3m6so","documentId":"fe548e34-6659-4ff5-86f6-eee2c43aec90","name":"算子列表","repoName":"AIHC","filePath":"操作指南/AI数据处理/算子列表","disabled":false,"path":"Imob3m6so","lastMergeTime":null,"isApiDoc":null,"httpMethod":null,"seo":null,"sourceOrgName":null,"sourceRepoName":null,"sourceDocumentId":""},{"id":"emob46zqa","documentId":"75bfe00e-1f9b-4051-80c6-98c160623660","name":"其他","repoName":"AIHC","filePath":"操作指南/AI数据处理/算子列表/其他","disabled":false,"path":"emob46zqa","lastMergeTime":null,"isApiDoc":null,"httpMethod":null,"seo":null,"sourceOrgName":null,"sourceRepoName":null,"sourceDocumentId":""}]},"next":{"id":"lm56m8w1i","name":"数据集管理","path":"lm56m8w1i","filePath":"操作指南/数据集管理/创建数据集.md","seo":{"title":"","keywords":"","description":"","serviceType":null},"parentIds":["ilib2qygp","qm5mbqn8c"],"parents":[{"id":"ilib2qygp","documentId":"bfa43a8b-968a-41a1-8c9d-906507eeaed9","name":"操作指南","repoName":"AIHC","filePath":"操作指南","disabled":false,"path":"ilib2qygp","lastMergeTime":null,"isApiDoc":null,"httpMethod":null,"seo":null,"sourceOrgName":null,"sourceRepoName":null,"sourceDocumentId":null},{"id":"qm5mbqn8c","documentId":"9a62072c-8cce-4abb-9f9d-f3cece57360c","name":"数据集管理","repoName":"AIHC","filePath":"操作指南/数据集管理","disabled":false,"path":"qm5mbqn8c","lastMergeTime":null,"isApiDoc":null,"httpMethod":null,"seo":{"title":null,"keywords":null,"description":null,"serviceType":null},"sourceOrgName":null,"sourceRepoName":null,"sourceDocumentId":null}]},"parents":[{"id":"ilib2qygp","documentId":"bfa43a8b-968a-41a1-8c9d-906507eeaed9","name":"操作指南","repoName":"AIHC","filePath":"操作指南","disabled":false,"path":"ilib2qygp","lastMergeTime":null,"isApiDoc":null,"httpMethod":null,"seo":null,"sourceOrgName":null,"sourceRepoName":null,"sourceDocumentId":null},{"id":"Ymo88m8hi","documentId":"c8cb5e38-f8c5-40f4-a424-b0c7895f0c0a","name":"AI数据处理","repoName":"AIHC","filePath":"操作指南/AI数据处理","disabled":false,"path":"Ymo88m8hi","lastMergeTime":"2026-04-21 14:23:10","isApiDoc":null,"httpMethod":null,"seo":null,"sourceOrgName":null,"sourceRepoName":null,"sourceDocumentId":""}],"specificSeo":null}}}