If you are one of the 40 million people who enjoy reading or writing the mostly romantic werewolf, superhero or historical fiction stories found on Canadian startup Wattpad, you may also be contributing to the development of the next generation of artificial intelligence.
In a new paper called Augur: Mining Human Behaviors from Fiction to Power Interactive Systems, a group of Stanford University computer science researchers revealed that they used the Wattpad "corpus" – a collection of almost two billion words (or 600,000 chapters) written by regular people – to help a computer understand the world around it. The team intends to make the program they built, Augur, into an open-source tool that other researchers can build on.
"The basic idea is that it's very difficult to program computers to understand the broad range of things that people do," says fourth-year PhD student Ethan Fast, co-author of the paper (published as part of the upcoming Computer Human Interaction conference) and a member of Stanford's Human-Computer Interaction Group. "Fiction has a lot of useful things to say about the world, and if you have enough of it, you can model it in much more depth than you could hope to manually."
Until recently, Toronto-based Wattpad, founded in 2006, didn't make its data available to researchers, and it may not have happened in this case if it weren't for the intervention by co-founder Ivan Yuen, who knows members of the Stanford team. More than 200 million uploads (some stories, some just chapters) have been shared on Wattpad, the majority of its users are under 30 and they spend 13 billion minutes a month on the service. So far, the company, which has 112 employees, has raised more than $66-million (U.S.) in venture capital financing.
"When we started this in 2014, we knew there was value in the corpus, but we hadn't really explored it too much," Wattpad's head of engineering, Jordan Christensen, says. "As we started working with the Stanford guys, it really opened our eyes a bit and now … through our own internal research and with partners, we are really starting to change the way we think about Wattpad."
Using the Wattpad data, the Stanford team developed a model with "54,075 human activities [related to] 13,843 objects and locations. For example 'unfold letter' occurs only 203 times in our dataset, yet Augur connects it to 1,072 different objects (e.g., handwriting, envelope). A more frequent activity like 'take picture' occurs 10,249 times, and is connected with 5,250 objects (e.g., camera, Instagram)."
To test the true-to-life nature of the Wattpad stories, the research team combined two existing technologies: Computer vision models that identified objects with a camera (via a Google Glass headset) and described them in text, and also natural language processing to "read" all the Wattpad fiction to create the database of references for human activities. It turns out that the youth-friendly fiction on Wattpad (of which more than 24 hours of new reading material are uploaded every minute) tends to be exceptionally good at describing the modern world in ways a computer can interpret.
Augur then attempts to predict what is happening by comparing objects with expected behaviours.
"People do things like work. When does work happen? When you're around a laptop or when you're around a computer monitor," says Mr. Fast, adding that the Augur engine was able to figure out the context of a given situation with 71-per-cent precision. The team would have been happy with 50 per cent.
Mr. Fast and his team also explored some of the software tools contextual awareness makes possible, such as "a phone that silences itself when the odds of you answering it are low, and a dynamic music player that adjusts to your present activity."
The process of "training" an artificially intelligent system to think somewhat like a human is an attempt to create a list of experiences, but as it turns out, even very large samples of source material come with biases. In the case of Wattpad's corpus, the main issue is one its teen readers would understand: Drama.
"One obvious bias of fiction in general is that things are happening that are designed to create drama, which doesn't necessarily mean that's something that should happen," Mr. Fast says. For instance, when Augur noted a phone was present while the tester was cursing, one of the conclusions the engine drew was that a phone was about to get thrown across the room. Like punching someone in the face at the slightest provocation, that's the kind of thing a teen might write about, but which happens in the real world a lot less than Wattpad stories might suggest.
But this kind of flaw isn't unique to the Wattpad data. For instance, there is a Google Books corpus that the Stanford team has assessed that contains a huge variety of literary and classic fiction, which introduces a whole other sort of bias. "If you have a larger weight of pre-modern fiction, you get a lot of stuff that makes less sense," Mr. Fast says. "It's less focused on the modern world, it doesn't know what a cellphone is, doesn't know what Facebook is. In some sense, the bias of super old stuff is worse than the dramatic bias of poorly crafted fan fiction. And these amateur writers are actually more focused on mundane details that we would find useful than the great works of literature."
Wattpad is now working with researchers from the University of Toronto and McGill on new natural language processing (NLP) programs that might be able to do things such as create unbiased summaries of a story's content. The platform has 22 official categories or genres, but by using NLP systems, Mr. Christensen thinks the company could automate the creation of new categories. For example, Wattpad has recently identified two trending themes (dangerous love fiction about creepy emotional entanglements, and urban fiction with stories set in criminal undergrounds such as gangs) but an automated process might be able to see those trends long before humans would find the time to read all the new stories.
But as those two new genres suggest, Wattpad fiction's other bias is toward the racy interpretation of any situation. Mr. Fast says the Stanford team weeded out the most sex-drenched human "activities" contained in the corpus because that's not what they were interested in.
"There were things we did not want to emphasize," Mr. Fast says.