In this repo, I have implemented the full process of RedTeaming with GFN.

Full Process

First, I use the toxic dataset proposed by [PKU-Alignment/PKU-SafeRLHF] to finetune a gpt2, making it as the red model.

Given a fixed input to it, for example

The red model will generate an attack question

Then attack question will be given to the target model, hope of acquiring an toxic response

The reward model LLaMAGuard-7B is a 2-class classifier, if it is not safe, the reward is 1, otherwise the reward is 1e-3.

Finally, we don’t want to leave the original red model to far (in case not to generate some nomeaning words) there is a reference model to computer the KL distance as negative reward

I use DB as the GFN objective.

The details in sft red model process

I use the toxic dataset proposed by [PKU-Alignment/PKU-SafeRLHF] to finetune a gpt2.