Related Resources

We provide related resources of end-to-end task-orient dailogue systems, including datasets and our resources repo.

Datasets

We list serveral commonly used datasets in EToD in the following:

Modularly EToD datasets

  • MultiWOZ MultiWOZ2.0 and 2.1 is both used in evaluations of different papers. MultiWOZ is one of the most widely used ToD dataset. It contains over 8,000 dialogue sessions and 7 different domains including: restaurant, hotel, attraction, taxi, train, hospital and police domain.

Fully EToD datasets

  • SMD Stanford Multi-turn Multi-domain Task-oriented Dialogue Dataset (SMD) includes three domains: navigation, weather, and calendar.

  • CamRest676. CamRest676 is a relatively small-scale restaurant domain dataset. It consists of 408/136/136 dialogues for training/validation/testing.

Other Resource of ToD datasets that might help EToD research

Multi-modal ToD Datasets

  • SIMMIC. Dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs (117K utterances) in the shopping domain, grounded in immersive and photo-realistic scenes.

  • MMConv. Multimodal Multi-domain Conversational dataset (MMConv) is a fully annotated collection of human-to-human role-playing dialogues spanning over multiple domains and tasks.

Survey of Datasets for EToD

Metrics and Evaluation Methods

We list some common metrics used for evaluating EToD system:

Modularly EToD Metrics

  • BLEU is used to measure the fiuency of generated response by calculating n-gram overlaps between the generated response and the gold response.

  • Inform and Success . Inform measures whether the system provides an appropriate entity and Success measures whether the system answers all requested attributes.

  • Combined is a comprehensive metric considering BLEU, Inform, and Success, which can be calculated by: Combined = (Inform + Success ) x 0.5+BLEU).

Fully EToD Metrics

  • BLEU is used to measure the fiuency of generated response by calculating n-gram overlaps between the generated response and the gold response.

  • Entity F1 is used to measure the difference between entities in the system and gold responses by micro-averaging the precision and recall.