mzwing

mzwing

Every moment we spent together is well worth recalling.
github
tg_channel
x
email
pixiv
bilibili
gitlab
zhihu
facebook
instagram

碎记·RWKV的邪典量化(llama.cpp only)

环境准备喵#

我水,就硬水(x

也不想细说了,直接上代码更有效(

#!/usr/bin/sh

llama_cpp_version="b4519"

user="mzwing"

# Create necessary folders
mkdir -p /home/$user/AI/repo/
mkdir -p /home/$user/AI/runner/
mkdir -p /home/$user/AI/model/

# Install llama.cpp repo
cd /home/$user/AI/repo/
git clone https://github.com/ggerganov/llama.cpp.git --depth 1
rye init llama_cpp
cd ./llama_cpp/
rye add numpy sentencepiece transformers gguf protobuf torch

# Install llama.cpp binary
cd /home/$user/AI/runner/
mkdir -p ./llama.cpp/
cd ./llama.cpp/
aria2c -c -x16 "https://github.com/MZWNET/actions/releases/download/llama_cpp-$llama_cpp_version/llama-$llama_cpp_version-bin-linux-avx2-intel-mkl-x64.zip"
unzip "llama-$llama_cpp_version-bin-linux-avx2-intel-mkl-x64.zip"
rm -rf "llama-$llama_cpp_version-bin-linux-avx2-intel-mkl-x64.zip"
aria2c -c -x16 https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt

# Install Huggingface CLI
cd /home/$user/AI/repo/
rye init huggingface_cli
cd ./huggingface_cli/
rye add huggingface_hub[hf_transfer]

# Install RWKV related environment
cd /home/$user/AI/repo/
rye init rwkv
cd ./rwkv/
rye add torch numpy

# Back to home
cd /home/$user/

什么?为什么要用$user?下文揭晓(((

到这里各位应该已经对我的石山代码有了心理预期了,别急,下面更石(x

简单说一下吧,之所以不直接在llama.cpp中执行rye init是因为llama.cpp官方用的是poetry,然而创出天际的 mzw 怎么可能乖乖遵循官方建议呢(x),所以直接新建了一个目录,继续用我的rye去!(x

venv安装 HF CLI 是因为rye install huggingface_hub[hf_transfer]是不会出现huggingface-cli这个 command 的(恼,屑rye

然后…… 差点就没有然后了#

接下来遇到的问题堪称重量级…… 搜索了很久也找不到怎么把 RWKV 转换成 Huggingface Format 或是直接从 pth 转换为 gguf。明明 RWKV 官方和 Transformers 一块推出过一个转换脚本,结果这脚本居然运行不了(也是服了,什么草台班子)。这下算是陷入困境了,找了大概两三天,都快放弃了。

甚至一度想过让 Deepseek 写一个,然而显然 Deepseek 并不清楚 RWKV6 的结构,搁那一通乱写,遂作罢。

最后实在忍不住了!跑去 btaskel 大佬那里发了个Discussion(用工地英语和人家交流(x)),自己也尝试了一些别的解决方案,最后才终于找到了最优解……Thanks btakel 佬 very much!

用到的脚本大概长这样:

# convert_rwkv6_to_hf.py
# Original code from <https://rwkv.cn/llamacpp#appendix-code>
# Edited by mzwing<mzwing@mzwing.eu.org>
# Convert the model for the pytoch_model.bin
import sys
import torch

if len(sys.argv) != 3:
    print(f"Convert RWKV6.0 pth (non-huggingface) checkpoint to Huggingface format")
    print("Usage: python convert_rwkv6_to_hf.py SOURCE_MODEL TARGET_MODEL")
    exit()

SOURCE_MODEL = sys.argv[1]
TARGET_MODEL = sys.argv[2]

# delete target model
import os

if os.path.exists(TARGET_MODEL):
    os.remove(TARGET_MODEL)

model = torch.load(SOURCE_MODEL, mmap=True, map_location="cpu")

# Rename all the keys, to include "rwkv."
new_model = {}
for key in model.keys():

    # If the keys start with "blocks"
    if key.startswith("blocks."):
        new_key = "rwkv." + key
        # Replace .att. with .attention.
        new_key = new_key.replace(".att.", ".attention.")
        # Replace .ffn. with .feed_forward.
        new_key = new_key.replace(".ffn.", ".feed_forward.")
        # Replace `0.ln0.` with `0.pre_ln.`
        new_key = new_key.replace("0.ln0.", "0.pre_ln.")
    else:
        # No rename needed
        new_key = key

        # Rename `emb.weight` to `rwkv.embeddings.weight`
        if key == "emb.weight":
            new_key = "rwkv.embeddings.weight"

        # Rename the `ln_out.x` to `rwkv.ln_out.x
        if key.startswith("ln_out."):
            new_key = "rwkv." + key

    print("Renaming key:", key, "--to-->", new_key)
    new_model[new_key] = model[key]

# Save the new model
print("Saving the new model to:", TARGET_MODEL)
torch.save(new_model, TARGET_MODEL)

#!/usr/bin/sh

author="Seikaijyu"
model="RWKV6-7B-v3-porn-chat"
suffix=""
size="7B"

user="mzwing"

# Create necessary folders
mkdir -p /home/$user/AI/model/$model$suffix-original/
mkdir -p /home/$user/AI/model/$model$suffix/

# Download the original model
aria2c -c -x16 "https://huggingface.co/$author/$model/resolve/main/$model$suffix.pth?download=true" -d /home/$user/AI/model/$model$suffix-original/ -o $model$suffix.pth

# Download RWKV6 config file
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/RWKV/v6-Finch-$size-HF /home/$user/AI/model/$model$suffix/
rm -rf /home/$user/AI/model/$model$suffix/*.bin
rm -rf /home/$user/AI/model/$model$suffix/*.safetensors

# Convert the original model to HF format
source /home/$user/AI/repo/rwkv/.venv/bin/activate
python /home/$user/convert_rwkv6_to_hf.py /home/$user/AI/model/$model$suffix-original/$model$suffix.pth /home/$user/AI/model/$model$suffix/pytorch_model.bin

# Clean up
rm -rf /home/$user/AI/model/$model$suffix-original/

authormodel是用来控制要量化的模型的,suffix则是用来控制像Seikaijyu/RWKV6-7B-v3-porn-chat下的RWKV6-7B-v3-porn-chat-pro.pth那种一个仓库下面放一些变体的情况。

size对应的是你要转换的模型的参数量,比如1B6之类(确切地说是基底模型的,然而一般而言偏差都不会太大的啦(x

链接那里自己改改罢,我石山不想碰(x

此处邪典可以说是点满了…… 我真的万万想不到居然这么玩,直接 clone 原模型的 config 下来然后替换pytorch_model.bin就完成了 HF 格式化……

另外记得即使是 RWKV v6 world 系列模型也请使用 RWKV 普通模型的 HF config,否则下面的llama.cpp转换会失败!会告诉你缺少vocab。(疑似是 RWKV v6 llama.cpp PR 实现的时候并没有考虑 world 模型,结果最后一看 world 也能用,行吧那就这样了,于是就出现了这么个诡异情况……)

终于可以convert_hf_to_gguf.py启动了……#

# Convert the model into gguf F16 format
mkdir -p /home/$user/AI/model/$model$suffix-GGUF/
source /home/$user/AI/repo/llama_cpp/.venv/bin/activate
cd /home/$user/AI/repo/llama.cpp/
python ./convert_hf_to_gguf.py --outtype f16 --outfile /home/$user/AI/model/$model$suffix-GGUF/$model$suffix.F16.gguf /home/$user/AI/model/$model$suffix/

# Clean up
rm -rf /home/$user/AI/model/$model$suffix/

# Back to home
cd /home/$user/

这里不用我多说了罢(x)就标准的llama.cpp转换流程(((

量化,启动#

做了一个小脚本自动量化(问就是懒)

#!/usr/bin/sh

model="RWKV6-7B-v3-porn-chat"
suffix=""
HF_TOKEN="xxx"

user="mzwing"

cd /home/$user/AI/runner/llama.cpp/

# Login
source /home/$user/AI/repo/huggingface_cli/.venv/bin/activate
huggingface-cli login --token $HF_TOKEN

# Upload F16 model
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli upload --repo-type model --commit-message "GGUF model commit (made with llama.cpp $llama_cpp_version)" "$model$suffix-GGUF" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.F16.gguf"

# generate imatrix
echo -e "Generating imatrix ...\n"
./llama-imatrix -m "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.F16.gguf" -f ./calibration_datav3.txt -o "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.imatrix"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli upload --repo-type model --commit-message "GGUF model commit (made with llama.cpp $llama_cpp_version)" "$model$suffix-GGUF" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.imatrix"

# quantize
params=( "Q8_0" "Q6_K" "Q5_K_M" "Q5_K_S" "Q5_1" "Q5_0" "Q4_K_M" "Q4_K_S" "Q4_1" "Q4_0" "Q3_K_L" "Q3_K_M" "Q3_K_S" "Q2_K_S" "Q2_K" "IQ4_XS" "IQ4_NL" "IQ3_XS" "IQ3_M" "IQ3_S" "IQ3_XXS" "IQ2_M" "IQ2_S" "IQ2_XS" "IQ2_XXS" "IQ1_M" "IQ1_S" "TQ2_0" "TQ1_0" )
for param in "${params[@]}"; do
    echo -e "Converting to $param ...\n"
    ./llama-quantize --imatrix "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.imatrix" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.F16.gguf" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.$param.gguf" $param $(nproc)
    HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli upload --repo-type model --commit-message "GGUF model commit (made with llama.cpp $llama_cpp_version)" "$model$suffix-GGUF" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.$param.gguf"
    rm -rf "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.$param.gguf"
done

# Clean up
rm -rf /home/$user/AI/model/$model$suffix-GGUF/

# Back to home
cd /home/$user/

为什么要 login 呢?下一节谈(x

imatrix 那里折腾了我好久(忘记 RTFM 导致的),我此前就是因为没搞明白 imatrix 才暂时退坑 AI 量化的。我此前以为环境准备那里下载的calibration_datav3.txt能直接给llama-quantize用,而且我一直以为只有I-Quants才会用到imatrix(谁让这玩意首字母都是 I 呢(强行解释(难视)))。这次算是被好好上了一课了:imatrix是用来校准量化差的重要性矩阵Importance Matrix),除了F16/F32/BF16之外的量化都能从中受益(提高质量)。可以参考:Qwen Docs 中关于 llama.cpp 量化的描述(然而我在写这篇文章的时候才发现这个好东西,悲)。

此处的代码中我跟随bartowski(AI 量化真佬)的规范,用了他的calibration_datav3.txt数据集(真・万金油)进行imatrix生成。

另外提一嘴,个人认为只要生成imatrix用的数据集涵盖的方面够全、够长,就可以实现重要性矩阵的作用。当然更贴近 AI 训练数据的校准数据集当然是更好的,但是直接用万金油效果其实也大差不差(因为又不是给人看的,直接保存 token,那无论 LLM 输出了什么逆天回答,最终校准也应该是差不多的。灵感来源:Hackergame 2023 小型大语言模型星球的官方题解。当然我的想法没有任何数据支撑,欢迎各位大佬指出我的错误())

此处我还量化了T-Quants,然而根据我搜索llama.cpp的 GitHub repo 得到的资料,T-Quants仍然处于早期阶段,暂时对llama.cppmaster分支而言,T-Quants对于带 AVX2 加速的 CPU 有很好的加速作用,其他则表现一般;对 GPU 的支持则仍然处于未合并的 PR 阶段。

白嫖的下场#

然而众所周知,mzwing 一贯热爱白嫖,所以这次量化是在 Huggingface Space 上运行的(用的 code-server),结果不出意外的出意外了:量化到一半 space 自动重启了…… 导致我此前辛辛苦苦完成的量化,咚咚咚

为此我给上面的llama_cpp_quantize.sh加上了每完成一步就自动上传结果,并且写出了个resume_quantization.sh(靠石山打败石山(乱雾))。

(什么?让我放弃白嫖?不可能的.webp)

目前的估计是如果在 Huggingface Space 中自定义 Dockerfile 且长时间高占用 CPU(我量化的时候直接给llama-quantize设置$(nproc)了(心虚))的话,space 就会自动重启,导致保存在非持久性存储的进度丢失。暂时的话配合着resume_quantization.sh也能用。由于过于石山且大家都能猜到怎么写的,就不放出来了(((

实验成果:

惯例的挖坑#

总算可以填掉环境准备那里挖的坑力!(x

目前打算做一个名为autoggufy的项目实现自动量化 + 自动恢复量化进度,然而 AI 写的v(-1)(什么 neta v0(x)并不能达到我想要的效果,我又没有那么多时间自己写(悲)

你以为这就结束了?(x#

没想到吧,我还能水!.jpg(x

第二部分搜索的时候意外发现了BBuf/RWKV-World-HF-Tokenizer,本来是打算搜不到别的办法的话就用这个有点草台班子的 py 小脚本将 rwkv6 转换为 HF Format 算了(详见其README.md),还好 btaskel 大佬给我提供了一个相对完美的方案(

然而,这个仓库里面有一个妙妙脚本!

# convert_rwkv5_to_6.py
# Original code from <https://github.com/BBuf/RWKV-World-HF-Tokenizer/blob/main/scripts/convert5to6.py>
import sys
import math
import torch
from collections import OrderedDict
import re

if len(sys.argv) != 3:
    print(f"Converts RWKV5.2 pth (non-huggingface) checkpoint to RWKV6.0")
    print("Usage: python convert5to6.py in_file out_file")
    exit()

model_path = sys.argv[1]

print("Loading file...")
state_dict = torch.load(model_path, map_location='cpu')

def convert_state_dict(state_dict):
    n_layer = 0
    n_embd = 0
    dim_att = 0

    state_dict_keys = list(state_dict.keys())
    for name in state_dict_keys:
        weight = state_dict.pop(name)

        # convert time_decay from (self.n_head, self.head_size) to (1,1,args.dim_att)
        if '.att.time_decay' in name:
            weight = weight.view(1,1,weight.size(0)*weight.size(1))
            n_embd = dim_att = weight.size(-1) 
        # convert time_mix_k, v, r, g into time_maa for both TimeMix and FFN
        if '.time_mix_' in name:
            name = name[:-5] + 'maa_' + name[-1:]
            weight = 1.0 - weight

        if name.startswith('blocks.'):
            layer_id_match = re.search(r"blocks\.(\d+)\.att", name)
            if layer_id_match is not None:
                n_layer = max(n_layer, int(layer_id_match.group(1)) + 1)

        state_dict[name] = weight

    # add in new params not in 5.2
    for layer_id in range(n_layer):
        layer_name = f'blocks.{layer_id}.att'

        ratio_0_to_1 = layer_id / (n_layer - 1)  # 0 to 1
        ratio_1_to_almost0 = 1.0 - (layer_id / n_layer)  # 1 to ~0
        ddd = torch.ones(1, 1, n_embd)
        for i in range(n_embd):
            ddd[0, 0, i] = i / n_embd

        state_dict[layer_name + '.time_maa_x'] = (1.0 - torch.pow(ddd, ratio_1_to_almost0))
        state_dict[layer_name + '.time_maa_w'] = (1.0 - torch.pow(ddd, ratio_1_to_almost0))

        TIME_MIX_EXTRA_DIM = 32 # generate TIME_MIX for w,k,v,r,g
        state_dict[layer_name + '.time_maa_w1'] = (torch.zeros(n_embd, TIME_MIX_EXTRA_DIM*5).uniform_(-1e-4, 1e-4))
        state_dict[layer_name + '.time_maa_w2'] = (torch.zeros(5, TIME_MIX_EXTRA_DIM, n_embd).uniform_(-1e-4, 1e-4))

        TIME_DECAY_EXTRA_DIM = 64
        state_dict[layer_name + '.time_decay_w1'] = (torch.zeros(n_embd, TIME_DECAY_EXTRA_DIM).uniform_(-1e-4, 1e-4))
        state_dict[layer_name + '.time_decay_w2'] = (torch.zeros(TIME_DECAY_EXTRA_DIM, dim_att).uniform_(-1e-4, 1e-4))

    print(f"n_layer: {n_layer}\nn_embd: {n_embd}")

    return state_dict

state_dict = convert_state_dict(state_dict)

torch.save(state_dict,sys.argv[2])
print("DONE. File written.")
# Download the original model
aria2c -c -x16 "https://huggingface.co/$author/$model/resolve/main/$model$suffix.pth?download=true" -d /home/$user/AI/model/$model$suffix-original/ -o $model$suffix-5.pth

# Convert RWKV5 to RWKV6
source /home/$user/AI/repo/rwkv/.venv/bin/activate
python /home/$user/convert_rwkv5_to_6.py /home/$user/AI/model/$model$suffix-original/$model$suffix-5.pth /home/$user/AI/model/$model$suffix-original/$model$suffix.pth
rm -rf /home/$user/AI/model/$model$suffix-original/$model$suffix-5.pth

神奇…… 居然能将 RWKV 5.2 转换为 RWKV 6.0……

于是果断开始实验,结果还被我干成功了?实验成果

后记#

总之这就是 RWKV 的诡异量化方式了…… 也算是做个备忘吧。现在高三了还搞这些也是顶了不少的压力。

另外本文我刚开始是打算让 Deepseek 代劳的,结果不出意外的出意外了(VSCode 的自动补全如是说道(x))Deepseek 与我的文风差距过大,不得已决定自己动手完成这篇文章,也算是对年更博主的一个交代吧(

本文出现的代码与文章遵循同样的开源协议(引用的代码部分则遵循其代码库的开源协议)。

谢谢你看到这里(

加载中...
此文章数据所有权由区块链加密技术和智能合约保障仅归创作者所有。