After IBM Watson won on Jeopardy!, GM Manoj Saxena exclaimed that it “rivaled a human’s ability to answer questions posed in natural language with speed, accuracy and confidence.”1 And that was 8 years ago. At the exact same time, when asked by Time, “Why aren't you letting Watson speak for himself today?”, Watson’s chief scientist David Ferrucci answered “Watson is trained to answer questions for Jeopardy! It's not an interactive dialogue system, so it can't conduct its own interviews.”2 A bit incoherent, to say the least. Either “answer questions” is not “interactive dialogue”, or interview is not “answer questions”?
Now that we have seemingly earned the right to claim “state-of-the-art” on SQuAD (and where is Watson?), we are telling you straight that we didn’t “rival human” in anything: the test was too easy!
The much more interesting question is, “How should we decide if any AI has rivaled human in anything?”
There is an old Chinese saying, “ To tell a horse from a mule, take it for a ride! ” (是骡子是马拿来溜溜). Within the realm of NLU, we want to take this opportunity to propose such a ride, namely, the World Series of Language Games (www.wslg.org).
What should we be looking for in the world champion of WSLG? “U” for understanding, of course! The winner needs not to know all the correct answers or responses (no one knows everything!), but minimally, it must maintain its coherence (being self-consistent and relevant) throughout. After that, it needs to beat everyone else in correct answers or responses.
In our opinion, the following are the necessary requirements of WSLG:
- Open: The games must be open to any system (including humans) and open to any domain or language;
- Fair: The performance in games should be the only criteria to determine the winner;
- Efficient: The games must be efficient to play and score;
- Entertaining: The games should be continuous, dynamic, entertaining and open for spectators to watch or study; and
- Final Arbiter: Since human is the only known species that possesses “understanding”, we must be the final arbiter for the “U” part.
Considering all the above, we have come up with the following set of rules for WSLG:
- Competitors will play against each other in 1-on-1 languages games as turn by turn conversations.
- The total number of rounds (turns) and other specifics of each game is negotiated between the competitors. The negotiation is part of the game.
- Everyone starts with 0 point and every win is worth 1 point, otherwise 0 point.
- Competitors are stratified according to their total points, with games being played only between opponents from the same or neighboring levels.
- Technical Knockout: At any time, one player can call out the opponent as being incoherent, after which a human jury of 3 from the pool of participants (required) and other volunteers will determine the result
- By a unanimous vote, the called side is disqualified and lose the game for being incoherent.
- Otherwise, the calling side is assessed 1 strike. 3 strikes, it is out and loses the game.
- Score tally: if there is no TKO, after the agreed rounds, each side picks 10 answers from the opponent’s responses to submit to a human jury of 3 . By a majority vote, each answer is evaluated on its coherence and correctness as follows: correct answer=+1, coherent but incorrect answer=0, incoherent answer=-10. The side with more scores wins, provided that its total score is positive.
- Final Verdict: at the end of each season, the non-human system with the most points is crowned as the world champion and presented with the Wittgenstein Cup, provided that its total points is positive. In addition, it gets a chance to prove to a human panel of 5 judges that it actually “understands” natural language.
That is it. Bring it on and tell me what you got, AIs.