In the case of supervised Discovering, the trainers performed both sides: the consumer as well as the AI assistant. In the reinforcement Mastering stage, human trainers to start with ranked responses which the design experienced developed in a past conversation.[fourteen] These rankings had been made use of to generate "reward models" which were us