Location: New York
Want to work on a state-of-the-art open-source visualization system, to allow journalists and curious people everywhere to make sense of enormous document dumps, leaked or otherwise?
The project is called Overview. You can read about it at overview.ap.org
. It’s going to be a system for the exploration of large to very large collections of unstructured text documents. We’re building it in New York in the main newsroom of The Associated Press, because we need to be working closely with real users. The AP has to deal with document dumps constantly. They download them from government sites. They file over 1000 freedom of information requests each year. They look at every single leak
from Wikileaks, Anonymous, Lulzsec. They’re drowning in this stuff. They need better tools. So does everyone else.
So w’ere going to make the killer app for document set analysis. Overview will start with a visual programming language for computational linguistics algorithms. Like Max/MSP for text. The output of that will be connected to some large-scale visualization. All of this will be backed by a distributed file store and computed through map-reduce. Our target document set size is 10 million. The goal is to design a sort of visualization sketching system for large unstructured text document sets. Kinda like Processing
, maybe, but data-flow instead of procedural.
We’ve already got a prototype working, which we pointed at the Wikileaks Iraq and Afghanistan data sets and learned some interesting things
. Now we have to engineer an industrial-strength open-source product. It’s a challenging project, because it requires production implementation of state-of-the-art, research-level algorithms for distributed computing, statistical natural language processing, and high-throughput visualization. And, oh yeah, a web interface. So people can use it anywhere, to understand their world.
Because that’s what this is about: a step in the direction of applied transparency. Journalists badly need this tool. But everyone else needs it too. Transparency is not an end in itself — it’s what you can do with the data that counts. And right now, we suck at making sense of piles of documents. Have you ever looked at what comes back from a FOIA request, or what you can download from from data.gov? It’s not pretty. Governments have to give you the documents, but they don’t have to organize them. What you typically get is a 10,000 page PDF. Emails mixed in with meeting minutes and financial statements and god-knows what else. It’s like being let into a decrepit warehouse with paper stacked floor to ceiling. No boxes. No files. Good luck.
Intelligence agencies have the necessary technology, but you can’t have it. The legal profession has some pretty good “e-discovery” software, but it’s wildly expensive. Law enforcement won’t share either. There are a few cheapish commercial products but they all choke above 10,000 documents because they’re not written with scalable, distributed algorithms. (Ask me how I know.) There simply isn’t an open, extensible tool for making sense of huge quantities of unstructured text. Not searching it, but finding the patterns you didn’t know you were looking for. The big picture. The Overview.
So we’re making one. We need developers, especially for the front end: visualization design, and a sweet user interface. Here are the buzzwords we are looking for in potential hires:
- Be a genuine computer scientist, or at least be able to act like one. There will be real algorithms here.
- But it’s not just research. We have to ship production software. So be someone who has done that, on a big project.
- This stuff is complicated! The UX has to make it simple for the user. Design, design, design!
- We’re open-source. Are you good at leading a distributed development community?
We are funded and hiring immediately. It’s a two-year contract to start. We’ve got a desk with a really nice view of the Hudson river.
For more information, see :
To apply: Send resume to email@example.com.
View full post on Recent Programming Jobs
303 total views, 1 today