When it comes to journalism, technology hasn’t exactly been a boon. The combination of free, Internet-based listings sites and the comparatively lower advertising rates on the web compared to print have left newsrooms around the globe withering to the bone. Algorithmic computer programs are now able to write news articles that readers are unable to differentiate from ones written living, breathing human beings.
But, at least, real journalists and editors are needed to identify newsworthy topics and craft intriguing headlines. Computer programs can’t do that, right?
At the second annual workshop on Social News On the Web (SNOW) held earlier this month in Seoul, South Korea, nine teams of computer scientists from around the world competed to devise systems that could use Twitter data to automatically generate newsworthy topics, come up with headlines for potential stories on those topics, and then find pictures to accompany those stories.
The result: We should probably all start polishing our resumes.
According to the competition’s organizers, the goal isn’t to put real journalists out of a job. Rather, the plan is create tools journalists could use to wade through Twitter’s ever-frothy sea of information to fish out important news stories.
Consider a scenario of news professionals who use social media to monitor the newsworthy stories that emerge from the crowd. The volume of information is very high and it is often difficult to extract such stories from a live social media stream. The task of this challenge is to automatically mine social streams to provide journalists a set of headlines and complementary information that summarize the most important topics for a number of time slots...of interest.
Nine teams participated in the competition, collecting over one million tweets.
The winning team was comprised of a trio of researchers from the Insight Centre for Data Analytics at University College Dublin. Their system employed a multi-step process that began by identifying Twitter users who the most likely to tweet about newsworthy events.
Once those users were identified, it collected their tweets, removed extraneous information like hashtags and @ mentions, and grouped similar tweets into clusters containing similar words that suddenly saw a spike in activity. Those clusters were then assigned scores based on characteristics like the casual or official nature of the language and their likelihood of mentioning specific named entities like Barack Obama.
After the system had identified clusters of tweets that it viewed as the most likely to be about a single newsworthy event, it then pulled the earliest tweet in the group as a proposed headline for the topic. Those tweets were then automatically cleaned up into a state where they read like real headlines. Pictures were identified by following URLs inserted in the tweets and then locating images on the accompanying web pages.
The results were fairly impressive:
The new, full Godzilla trailer has roared online
Ukraine Currency Hits Record Low Amid Uncertainty
Ooh, my back! Why workers' aches pains are hurting the UK economy
Uganda: how campaigners are preparing to counter the anti-gay bill
Fans gather outside Ghostbusters firehouse in N.Y.C. to pay tribute to Harold Ramis
Man survives a shooting because the Bible in his top pocket stopped two bullets
Ukraine's toppling craze reaches even legendary Russian commander, who fought Napoleon
Other entrants into the competition, such as a team comprised of researchers based in Belgium and the UK, used slightly different methods. For example, their headlines were generated by cleaning up what they saw at the most ‟representative” sentence from all the tweets in an individual clusters rather than always going off of the earliest tweet.
While this system does automate some the news gathering functions of human journalists, it still requires there to be humans journalists doing the actual, initial legwork. So, New York Times writers, you’re probably in the clear.
However, when it comes to writers at viral aggregation sites whose business models largely consist of putting click-y headlines on other peoples’ work, they might want to start looking over their shoulders.
Photo by Daniel R. Blume/Wikimedia Commons