What we do, is a systematic harvesting of public records, that is, metadata describing the flow of letters, reports and other documents to and from various public offices.
In 2016 we changed most of the technology and our user-interface because we wanted to provide a combined search for all the archives in Oslo municipality, the capitol of Norway.
We have now more than 7,5 million document-references in our database. We keep track of which of the 98 archives in our application that “owns” the document in question. Add-to-basket is enabled, and when the basket is full, the user can send a FIOA-request with one click.
The result is an application that makes it easier to work with written sources in investigative journalism. But it is open also to other groups to use, free of charge.
The Oslo records alone combines 45 different archives and more than 4 million of our document references.
This must be done in two parts, the idea behind Innsyn.no itself, and the idea with the Oslo records as a separate.
We struggled with the municipality in our hometown, Kristiansand, who used to provide searchble electronic records, which we in the press used on an everyday basis. But Kristiansand removed its database from the web, and replaced the tool with image-PDF-files which had to be read manually. Our newspaper had lost its most poweful tool for local investigative journalism, and decided to do something about it.
The first version of our tool contained Kristiansand. Later, we added the police districts, the universities, the armed forces and the church.
Then our newspaper had a dialogue with Oslo Redaktørforening, the editors of the mediahouses in Oslo. They asked if we could solve the problem with access to the public documents in the capitol. The problem was, that Oslo had no common archive plan, and more than 45 autonomous archives. They paid, and we solved the problem.
With all our data, it is the public archives themselves who are the sources in our Oslo-project.
Historic data 6 years back was aquired with FOIA-requests by email.
Data 3 months back and weekly updates from launch, we use webscraping to collect.
For other sources, like the police and armed forces, we subscribe to these records by email, and use an Outlook macro to grab the attachments and send the data to the same Dropbox App folder where we place the scraped files from Oslo.
While doing this, all file names are updated to reflect the source of the document, we add a lettercode to the filename. From Dropbox, the files are uploaded to DocumentCloud for OCR treatment. Then, on a daily basis, we run a script that imports data from DocumentCloud to our Postgres-base where data is structured with a parser assigned to each archive source. At this point, the journal entries are searchable with the full-text search capabilities in Postgres and searches can be made across all sources.
Four people where involved in the project with the Oslo records. Tarjei Leer-Salvesen was in charge, and gathered the historic data from may to August 2016. Kenneth Lykkås did most of the technical work, assisted by Atle Brandt. frode Nordby designed the new user interface.
Oslo Redaktørforening paid NOK 50.000 the efforts. We invested approximately equally in the project ourselves, to implement the technological improvements we developed for the Oslo records to the rest of our application.
Innsyn.no has made Oslo municipality more transparent. The mayor of Oslo thanked us for the project: http://journalisten.no/2016/10/oslos-ordf%C3%B8rer-trenger-at-flere-ser-politikerne-i-korta
More importantly, our journalist colleagues in Oslo have gotten the tool they need to do a better job with written sources. After a few months, I hear some of my colleagues there have filed a great number of FOIA-requests through the app. I know it is now being used by a lot of users also outside the journalism-profession.
Some archives did not want to share data at all. We had to use the law as a tool, and file complaints in order to get the data we wanted to work with.
Having collected data from 45 different Oslo-sources, using six different types of archive software, we were faced with the problem that although all records in theory contain the same types of data, all archives need their own parser for text-extraction. The records are PDF documents that are converted to text using Tesseract OCR through DocumentCloud. The text representation vary a lot between the sources. Additionally, there are many edge cases within documents from the same source on text placement. To remedy this, we have utilized several patterns for text extraction, including parsing documents top-down, by pages and by journal entries. Our aim is to structure the data with the best possible quality, although some documents (like when a PDF document is a scan of a printout), might not be handled 100% correctly as of yet.