Representation engineering using control vectors
Complex system prompts are often used to safeguard LLMs
But they also can be subverted 👇
I recently learned about "Representation Engineering" using control vectors. These control vectors can be applied to a model at the time of inference to influence how the model responds to requests.
In a post written by Theia Vogel, she explains how these control vectors could protect against jailbreaking techniques:
"The whole point of a jailbreak is that you're adding more tokens to distract from, invert the effects of, or minimize the troublesome prompt. But a control vector is everywhere, on every token, always."
This technique could result in a less subvertible agent. 👏👏👏
I highly recommend you read Theia's post