

Situation:
​After the Alexa+ initial early access announcement but before the release, both leadership and anecdotal feedback was clear: Alexa+ was not yet meeting the quality bar for conversational excellence. It was described as "dull and robotic," "sycophantic," and "asking too many follow-up questions." My team, since we owned CxD for the core Alexa+ experience, needed to take more action beyond what we had already done, which was iteratively giving input/feedback to the horizontal prompt instructions.
Action:
Modifying the prompt instructions for Alexa+ weren't having enough of an impact on the model's output to address the robotic tone of the generated interactions. In order to improve the responses, we first had to identify what "robotic" meant, i.e. why both leadership and customers gave this feedback and what it meant in terms of language and patterns. After deep diving into the defects, the top issues found were repetitiveness, verbosity (too much talking), overly apologetic responses, and over-facilitation (asking too many questions).
Once we knew the patterns and had some metrics to evaluate them, we had to come up with a way to improve these beyond simple prompt engineering. This was a complex process but in short, it would require taking output and reviewing it, then creating both good and bad example datasets that could be used for both model fine-tuning and model evaluation.
Working closely with my Principal CxD lead designer (direct report) and the science team, I estimated the amount of CxD bandwidth we would need in order to rewrite generated sample dialogs and found that it was far beyond what my CxD team could absorb. Based on a process I'd done in a previous year to hire 15 CxD contractors, I hired and onboarded 10 CxD contractors and trained them in this new task. We required hundreds of sample dialogs to be written/rewritten and reviewed carefully against a calibrated "judge" which was repeatedly iterated to meet a required level of success. The ultimate delivery to the science team included the judge, along with a pairwise dataset with both defective and non-defective responses that met their requirements for model training.
Results:
Over the course of several model releases, the combined team's effort made significant reductions in defect rates across live traffic. Results showed a 20% decrease in repetition defect rate, a 31% decrease for verbosity, a 62% decrease for over-apologetic responses, and a 74% decrease in defect rate for over-facilitation.