Training Models to be Virtuous


Axios reported on a recent Washington Post report on the content of the datasets used to train LLMs like ChatGPT. The hook of the story is that the training data contains all kinds of content from across the internet, including lots of discussion from groups that engage in behavior most would find reprehensible. The natural response might be to train these models only on virtuous content.

Putting aside the question of who decides what is virtuous, I'm more curious about the best approach to train a model that has good behavior. Is it more effective to train it on all ideas, marking some as desirable or not, or to train only on a carefully curated set of ideas that are considered desirable?

It might be important to consider the intended use case. If the model will be used in an environment where inputs are very carefully controlled, it's likely just fine to give the model no awareness of concepts it will never encounter.

But what if the input to the model is unsanitized? How can an LLM recognize and caution against murder if it doesn't know about it? Do we want models providing the 'most probable' next word in a sentence that the model hasn't been exposed to? It seems that a prerequisite for saying a behavior is undesirable is to first define the behavior. Which is to say: I think we'd be better off focusing on how to inoculate models against poor behavior rather than trying to hermetically seal them off from anything bad during training.