Explainability
Overview
The SQuARE platform provides behavioral testing via Checklist. This is achieved by unit tests designed by the end-users or the system experts. The two most common types are Minimum Functionality Test (MFT) and INVariance (INV) as given in the table below.
Minimum Functionality Test (MFT)-Taxonomy | INVariance-Robustness |
---|---|
C: There is a tiny purple box in the room. | C: ...Newcomen designs had a duty of about 7 million, but most were closer to 5 million.... |
Q: What size is the box? | Q: What was the ideal duty->>udty of a Newcomen engine? |
Test: Check if the prediction is tiny. | Test: Check whether the prediction changes or not. |
MFTs are designed to measure a capability (e.g., Taxonomy capacity of matching object properties to categories) via specifying the expected behaviour (e.g., “tiny” in Table above). INVs tests are similarly refined for capabilities (e.g., robustness under spelling errors in question), however the expected behaviour is already known, i.e., the answer should remain the same.
Behavioral testing of skills
The users can choose the Skill they want to investigate from the drop-down menu. The `Show Checklist' button is activated once the predictions from the tests are saved in a JSON file.
Different tests are displayed success and failure rate of the skills. An exemplary visualization for testing of SQuAD skill is given in Figure below.
To analyze or process a Skill’s test performance in more detail, a full JSON report of all test examples can be downloaded using the Download all examples button.
To view the failed test cases in more detail, the user can click the Expand button. This allows the user to quickly identify changes the Skill could not handle.
Currently supported skills
Name | Retrieval Model | Datastore | Reader Model | Reader Adapter | Type | Code |
---|---|---|---|---|---|---|
BoolQ BERT Adapter | bert-base-uncased | boolq | categorical | code | ||
BoolQ RoBERTa Adapter | roberta-base | boolq | categorical | code | ||
CommonsenseQA BERT Adapter | bert-base-uncased | commonsense_qa | multiple-choice | code | ||
CommonsenseQA RoBERTa Adapter | roberta-base | commonsense_qa | multiple-choice | code | ||
CosmosQA BERT | bert-base-uncased | cosmos_qa | multiple-choice | code | ||
CosmosQA RoBERTa Adapter | roberta-base | cosmos_qa | multiple-choice | code | ||
DROP BERT Adapter | bert-base-uncased | drop | span-extraction | code | ||
DROP RoBERTa Adapter | roberta-base | drop | span-extraction | code | ||
HotpotQA BERT Adapter | bert-base-uncased | hotpotqa | span-extraction | code | ||
HotpotQA RoBERTa Adapter | roberta-base | hotpotqa | span-extraction | code | ||
MultiRC BERT Adapter | bert-base-uncased | multirc | multiple-choice | code | ||
MultiRC RoBERTa Adapter | roberta-base | multirc | multiple-choice | code | ||
NewsQA BERT Adapter | bert-base-uncased | newsqa | span-extraction | code | ||
NewsQA RoBERTa Adapter | roberta-base | newsqa | span-extraction | code | ||
QuAIL BERT Adapter | bert-base-uncased | quail | multiple-choice | code | ||
QuAIL RoBERTa Adapter | roberta-base | quail | multiple-choice | code | ||
QuaRTz RoBERTa Adapter | roberta-base | quartz | multiple-choice | code | ||
Quoref BERT Adapter | bert-base-uncased | quoref | span-extraction | code | ||
Quoref RoBERTa Adapter | roberta-base | quoref | span-extraction | code | ||
RACE BERT Adapter | bert-base-uncased | race | multiple-choice | code | ||
RACE RoBERTa Adapter | roberta-base | race | multiple-choice | code | ||
SQuAD 1.1 BERT Adapter | bert-base-uncased | squad | span-extraction | code | ||
SQuAD 1.1 RoBERTa Adapter | roberta-base | squad | span-extraction | code | ||
SQuAD 2.0 BERT Adapter | bert-base-uncased | squad_v2 | span-extraction | code | ||
Social-IQA BERT Adapter | bert-base-uncased | social_i_qa | multiple-choice | code | ||
Social-IQA RoBERTa Adapter | roberta-base | social_i_qa | multiple-choice | code |
Check out these skills on the SQuARE platform.