Fragmented Notes · RWKV's Cult Quantification (llama.cpp only)

Environment Preparation Meow#

I just want to get straight to the code, it's more effective (

#!/usr/bin/sh

llama_cpp_version="b4519"

user="mzwing"

# Create necessary folders
mkdir -p /home/$user/AI/repo/
mkdir -p /home/$user/AI/runner/
mkdir -p /home/$user/AI/model/

# Install llama.cpp repo
cd /home/$user/AI/repo/
git clone https://github.com/ggerganov/llama.cpp.git --depth 1
rye init llama_cpp
cd ./llama_cpp/
rye add numpy sentencepiece transformers gguf protobuf torch

# Install llama.cpp binary
cd /home/$user/AI/runner/
mkdir -p ./llama.cpp/
cd ./llama.cpp/
aria2c -c -x16 "https://github.com/MZWNET/actions/releases/download/llama_cpp-$llama_cpp_version/llama-$llama_cpp_version-bin-linux-avx2-intel-mkl-x64.zip"
unzip "llama-$llama_cpp_version-bin-linux-avx2-intel-mkl-x64.zip"
rm -rf "llama-$llama_cpp_version-bin-linux-avx2-intel-mkl-x64.zip"
aria2c -c -x16 https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt

# Install Huggingface CLI
cd /home/$user/AI/repo/
rye init huggingface_cli
cd ./huggingface_cli/
rye add huggingface_hub[hf_transfer]

# Install RWKV related environment
cd /home/$user/AI/repo/
rye init rwkv
cd ./rwkv/
rye add torch numpy

# Back to home
cd /home/$user/

What? Why use $user? The answer will be revealed later (

By now, you should have some expectations about my stone mountain code; don't worry, it gets even more stony (

To put it simply, the reason for not executing rye init directly in llama.cpp is that the official llama.cpp uses poetry, but how could the sky-high mzw possibly follow the official advice obediently ( ), so I created a new directory and continued to use my rye! (

Using venv to install HF CLI is because rye install huggingface_hub[hf_transfer] won't produce the huggingface-cli command (annoying, petty rye).

Then... there was almost nothing after that#

The next problem encountered was heavyweight... After searching for a long time, I couldn't find a way to convert RWKV to Huggingface Format or directly convert from pth to gguf. Clearly, RWKV's official team had released a conversion script together with Transformers, but this script couldn't run (I was also shocked, what a shoddy setup). This left me in a dilemma, and after searching for about two or three days, I was almost ready to give up.

I even considered asking Deepseek to write one, but obviously, Deepseek didn't understand the structure of RWKV6, and just wrote a bunch of nonsense, so I gave up.

Finally, I couldn't hold back! I went to the big shot btaskel and posted a Discussion (using construction site English to communicate with them ( ), and I also tried some other solutions, and finally found the optimal solution... Thanks very much, btakel!

The script used looks something like this:

# convert_rwkv6_to_hf.py
# Original code from <https://rwkv.cn/llamacpp#appendix-code>
# Edited by mzwing<mzwing@mzwing.eu.org>
# Convert the model for the pytoch_model.bin
import sys
import torch

if len(sys.argv) != 3:
    print(f"Convert RWKV6.0 pth (non-huggingface) checkpoint to Huggingface format")
    print("Usage: python convert_rwkv6_to_hf.py SOURCE_MODEL TARGET_MODEL")
    exit()

SOURCE_MODEL = sys.argv[1]
TARGET_MODEL = sys.argv[2]

# delete target model
import os

if os.path.exists(TARGET_MODEL):
    os.remove(TARGET_MODEL)

model = torch.load(SOURCE_MODEL, mmap=True, map_location="cpu")

# Rename all the keys, to include "rwkv."
new_model = {}
for key in model.keys():

    # If the keys start with "blocks"
    if key.startswith("blocks."):
        new_key = "rwkv." + key
        # Replace .att. with .attention.
        new_key = new_key.replace(".att.", ".attention.")
        # Replace .ffn. with .feed_forward.
        new_key = new_key.replace(".ffn.", ".feed_forward.")
        # Replace `0.ln0.` with `0.pre_ln.`
        new_key = new_key.replace("0.ln0.", "0.pre_ln.")
    else:
        # No rename needed
        new_key = key

        # Rename `emb.weight` to `rwkv.embeddings.weight`
        if key == "emb.weight":
            new_key = "rwkv.embeddings.weight"

        # Rename the `ln_out.x` to `rwkv.ln_out.x
        if key.startswith("ln_out."):
            new_key = "rwkv." + key

    print("Renaming key:", key, "--to-->", new_key)
    new_model[new_key] = model[key]

# Save the new model
print("Saving the new model to:", TARGET_MODEL)
torch.save(new_model, TARGET_MODEL)

#!/usr/bin/sh

author="Seikaijyu"
model="RWKV6-7B-v3-porn-chat"
suffix=""
size="7B"

user="mzwing"

# Create necessary folders
mkdir -p /home/$user/AI/model/$model$suffix-original/
mkdir -p /home/$user/AI/model/$model$suffix/

# Download the original model
aria2c -c -x16 "https://huggingface.co/$author/$model/resolve/main/$model$suffix.pth?download=true" -d /home/$user/AI/model/$model$suffix-original/ -o $model$suffix.pth

# Download RWKV6 config file
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/RWKV/v6-Finch-$size-HF /home/$user/AI/model/$model$suffix/
rm -rf /home/$user/AI/model/$model$suffix/*.bin
rm -rf /home/$user/AI/model/$model$suffix/*.safetensors

# Convert the original model to HF format
source /home/$user/AI/repo/rwkv/.venv/bin/activate
python /home/$user/convert_rwkv6_to_hf.py /home/$user/AI/model/$model$suffix-original/$model$suffix.pth /home/$user/AI/model/$model$suffix/pytorch_model.bin

# Clean up
rm -rf /home/$user/AI/model/$model$suffix-original/

author and model are used to control which model to quantize, while suffix is used to control cases like Seikaijyu/RWKV6-7B-v3-porn-chat where multiple variants are stored under one repository.

size corresponds to the parameter count of the model you want to convert, such as 1B6 (to be precise, it's for the base model, but generally, the deviation won't be too large (

Feel free to modify the links; I don't want to touch them (

This is quite a cult-like situation... I really never expected to do it this way, directly cloning the original model's config and then replacing pytorch_model.bin to complete the HF formatting...

Also, remember that even for RWKV v6 world series models, you should use the HF config of the regular RWKV model; otherwise, the subsequent llama.cpp conversion will fail! It will tell you that vocab is missing. (It seems that when implementing the RWKV v6 llama.cpp PR, the world model was not considered, and in the end, it turned out that the world model could also be used, so it ended up in this strange situation...)

Finally, `convert_hf_to_gguf.py` can be started...#

# Convert the model into gguf F16 format
mkdir -p /home/$user/AI/model/$model$suffix-GGUF/
source /home/$user/AI/repo/llama_cpp/.venv/bin/activate
cd /home/$user/AI/repo/llama.cpp/
python ./convert_hf_to_gguf.py --outtype f16 --outfile /home/$user/AI/model/$model$suffix-GGUF/$model$suffix.F16.gguf /home/$user/AI/model/$model$suffix/

# Clean up
rm -rf /home/$user/AI/model/$model$suffix/

# Back to home
cd /home/$user/

I don't need to say much here ( it's just the standard llama.cpp conversion process (

Quantization, Start#

I made a small script to automate quantization (just being lazy)

#!/usr/bin/sh

model="RWKV6-7B-v3-porn-chat"
suffix=""
HF_TOKEN="xxx"

user="mzwing"

cd /home/$user/AI/runner/llama.cpp/

# Login
source /home/$user/AI/repo/huggingface_cli/.venv/bin/activate
huggingface-cli login --token $HF_TOKEN

# Upload F16 model
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli upload --repo-type model --commit-message "GGUF model commit (made with llama.cpp $llama_cpp_version)" "$model$suffix-GGUF" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.F16.gguf"

# generate imatrix
echo -e "Generating imatrix ...\n"
./llama-imatrix -m "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.F16.gguf" -f ./calibration_datav3.txt -o "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.imatrix"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli upload --repo-type model --commit-message "GGUF model commit (made with llama.cpp $llama_cpp_version)" "$model$suffix-GGUF" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.imatrix"

# quantize
params=( "Q8_0" "Q6_K" "Q5_K_M" "Q5_K_S" "Q5_1" "Q5_0" "Q4_K_M" "Q4_K_S" "Q4_1" "Q4_0" "Q3_K_L" "Q3_K_M" "Q3_K_S" "Q2_K_S" "Q2_K" "IQ4_XS" "IQ4_NL" "IQ3_XS" "IQ3_M" "IQ3_S" "IQ3_XXS" "IQ2_M" "IQ2_S" "IQ2_XS" "IQ2_XXS" "IQ1_M" "IQ1_S" "TQ2_0" "TQ1_0" )
for param in "${params[@]}"; do
    echo -e "Converting to $param ...\n"
    ./llama-quantize --imatrix "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.imatrix" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.F16.gguf" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.$param.gguf" $param $(nproc)
    HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli upload --repo-type model --commit-message "GGUF model commit (made with llama.cpp $llama_cpp_version)" "$model$suffix-GGUF" "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.$param.gguf"
    rm -rf "/home/$user/AI/model/$model$suffix-GGUF/$model$suffix.$param.gguf"
done

# Clean up
rm -rf /home/$user/AI/model/$model$suffix-GGUF/

# Back to home
cd /home/$user/

Why do I need to log in? The next section will discuss (

The imatrix part troubled me for a long time (forgetting to read the manual caused it), I previously thought that the calibration_datav3.txt downloaded in Environment Preparation could be used directly with llama-quantize, and I always thought that only I-Quants would use imatrix (why does this thing start with I (forcing an explanation (hard to see) ). This time I learned a good lesson: imatrix is used to calibrate the quantization's importance matrix, and all quantizations other than F16/F32/BF16 can benefit from it (improving quality). You can refer to the description of llama.cpp quantization in Qwen Docs (however, I only discovered this great thing while writing this article, sad).

In this code, I followed the standards of bartowski (a true AI quantization expert) and used his calibration_datav3.txt dataset (truly universal) for imatrix generation.

Additionally, I believe that as long as the dataset used to generate the imatrix covers enough aspects and is long enough, it can achieve the function of the importance matrix. Of course, a calibration dataset that is closer to AI training data is better, but using a universal dataset can also yield similar results (because it's not for human viewing; as long as the tokens are saved, no matter what outrageous answers the LLM outputs, the final calibration should be similar. Inspired by: the official solution to the small language model planet of Hackergame 2023. Of course, my thoughts have no data support, and I welcome all experts to point out my mistakes ( ).

Here, I also quantified T-Quants, but according to the information I found in the llama.cpp GitHub repo, T-Quants is still in the early stages; for the master branch of llama.cpp, T-Quants has a good acceleration effect on CPUs with AVX2 acceleration, while other CPUs perform generally; support for GPUs is still in the PR stage and has not been merged.

The Consequences of Free Riding#

However, as we all know, mzwing has always loved free riding, so this quantization was run on Huggingface Space (using code-server), and as expected, an unexpected incident occurred: the space automatically restarted halfway through the quantization... causing the hard work I had done previously to be lost.

To address this, I added automatic uploads after each step to the above llama_cpp_quantize.sh, and wrote a resume_quantization.sh (using stone mountain to defeat stone mountain (random fog)).

~~(What? Let me give up free riding? Impossible.webp)~~

Currently, it is estimated that if you customize the Dockerfile in Huggingface Space and have high CPU usage for a long time (I set $(nproc) directly for llama-quantize when quantizing (feeling guilty)), the space will automatically restart, resulting in the loss of progress saved in non-persistent storage. For now, using resume_quantization.sh can still work. Since it's too stone mountain and everyone can guess how to write it, I won't share it (

Experimental results:

The Usual Pitfall#

Finally, I can fill the pit I dug in Environment Preparation (

I currently plan to create a project called autoggufy to achieve automatic quantization + automatic recovery of quantization progress, but the AI-written v(-1) (what neta v0 ( ) doesn't meet my expectations, and I don't have that much time to write it myself (sad).

Did you think this was the end? (#

Surprisingly, I can still ramble! .jpg (

While searching in Part Two, I unexpectedly discovered BBuf/RWKV-World-HF-Tokenizer. Initially, I planned to use this somewhat shoddy py script to convert rwkv6 to HF Format if I couldn't find any other methods (see its README.md), but fortunately, the big shot btaskel provided me with a relatively perfect solution (

However, there is a wonderful script in this repository!

# convert_rwkv5_to_6.py
# Original code from <https://github.com/BBuf/RWKV-World-HF-Tokenizer/blob/main/scripts/convert5to6.py>
import sys
import math
import torch
from collections import OrderedDict
import re

if len(sys.argv) != 3:
    print(f"Converts RWKV5.2 pth (non-huggingface) checkpoint to RWKV6.0")
    print("Usage: python convert5to6.py in_file out_file")
    exit()

model_path = sys.argv[1]

print("Loading file...")
state_dict = torch.load(model_path, map_location='cpu')

def convert_state_dict(state_dict):
    n_layer = 0
    n_embd = 0
    dim_att = 0

    state_dict_keys = list(state_dict.keys())
    for name in state_dict_keys:
        weight = state_dict.pop(name)

        # convert time_decay from (self.n_head, self.head_size) to (1,1,args.dim_att)
        if '.att.time_decay' in name:
            weight = weight.view(1,1,weight.size(0)*weight.size(1))
            n_embd = dim_att = weight.size(-1) 
        # convert time_mix_k, v, r, g into time_maa for both TimeMix and FFN
        if '.time_mix_' in name:
            name = name[:-5] + 'maa_' + name[-1:]
            weight = 1.0 - weight

        if name.startswith('blocks.'):
            layer_id_match = re.search(r"blocks\.(\d+)\.att", name)
            if layer_id_match is not None:
                n_layer = max(n_layer, int(layer_id_match.group(1)) + 1)

        state_dict[name] = weight

    # add in new params not in 5.2
    for layer_id in range(n_layer):
        layer_name = f'blocks.{layer_id}.att'

        ratio_0_to_1 = layer_id / (n_layer - 1)  # 0 to 1
        ratio_1_to_almost0 = 1.0 - (layer_id / n_layer)  # 1 to ~0
        ddd = torch.ones(1, 1, n_embd)
        for i in range(n_embd):
            ddd[0, 0, i] = i / n_embd

        state_dict[layer_name + '.time_maa_x'] = (1.0 - torch.pow(ddd, ratio_1_to_almost0))
        state_dict[layer_name + '.time_maa_w'] = (1.0 - torch.pow(ddd, ratio_1_to_almost0))

        TIME_MIX_EXTRA_DIM = 32 # generate TIME_MIX for w,k,v,r,g
        state_dict[layer_name + '.time_maa_w1'] = (torch.zeros(n_embd, TIME_MIX_EXTRA_DIM*5).uniform_(-1e-4, 1e-4))
        state_dict[layer_name + '.time_maa_w2'] = (torch.zeros(5, TIME_MIX_EXTRA_DIM, n_embd).uniform_(-1e-4, 1e-4))

        TIME_DECAY_EXTRA_DIM = 64
        state_dict[layer_name + '.time_decay_w1'] = (torch.zeros(n_embd, TIME_DECAY_EXTRA_DIM).uniform_(-1e-4, 1e-4))
        state_dict[layer_name + '.time_decay_w2'] = (torch.zeros(TIME_DECAY_EXTRA_DIM, dim_att).uniform_(-1e-4, 1e-4))

    print(f"n_layer: {n_layer}\nn_embd: {n_embd}")

    return state_dict

state_dict = convert_state_dict(state_dict)

torch.save(state_dict,sys.argv[2])
print("DONE. File written.")

# Download the original model
aria2c -c -x16 "https://huggingface.co/$author/$model/resolve/main/$model$suffix.pth?download=true" -d /home/$user/AI/model/$model$suffix-original/ -o $model$suffix-5.pth

# Convert RWKV5 to RWKV6
source /home/$user/AI/repo/rwkv/.venv/bin/activate
python /home/$user/convert_rwkv5_to_6.py /home/$user/AI/model/$model$suffix-original/$model$suffix-5.pth /home/$user/AI/model/$model$suffix-original/$model$suffix.pth
rm -rf /home/$user/AI/model/$model$suffix-original/$model$suffix-5.pth

Amazing... it can actually convert RWKV 5.2 to RWKV 6.0...

So I decisively started experimenting, and surprisingly, I succeeded? Experimental Results

Epilogue#

In short, this is the bizarre quantization method for RWKV... It's also a reminder for myself. Now that I'm in my senior year, doing all this adds quite a bit of pressure.

Additionally, I initially planned to let Deepseek handle this, but ~~as expected, an unexpected incident occurred~~ (as VSCode's autocomplete would say ( )). The gap between Deepseek and my writing style was too large, so I had no choice but to complete this article myself, which also serves as an explanation for ~~the yearly blogger~~ (

The code and articles appearing in this text follow the same open-source protocol (the quoted code parts follow the open-source protocol of their respective repositories).

Thank you for reading this far (