Nodabase.net

Innsyn.no – the Oslo records

Description

What we do, is a systematic harvesting of public records, that is, metadata describing the flow of letters, reports and other documents to and from various public offices.

In 2016 we changed most of the technology and our user-interface because we wanted to provide a combined search for all the archives in Oslo municipality, the capitol of Norway.

We have now more than 7,5 million document-references in our database. We keep track of which of the 98 archives in our application that “owns” the document in question. Add-to-basket is enabled, and when the basket is full, the user can send a FIOA-request with one click.

The result is an application that makes it easier to work with written sources in investigative journalism. But it is open also to other groups to use, free of charge.

The Oslo records alone combines 45 different archives and more than 4 million of our document references.

Origin

This must be done in two parts, the idea behind Innsyn.no itself, and the idea with the Oslo records as a separate.

We struggled with the municipality in our hometown, Kristiansand, who used to provide searchble electronic records, which we in the press used on an everyday basis. But Kristiansand removed its database from the web, and replaced the tool with image-PDF-files which had to be read manually. Our newspaper had lost its most poweful tool for local investigative journalism, and decided to do something about it.

The first version of our tool contained Kristiansand. Later, we added the police districts, the universities, the armed forces and the church.

Then our newspaper had a dialogue with Oslo Redaktørforening, the editors of the mediahouses in Oslo. They asked if we could solve the problem with access to the public documents in the capitol. The problem was, that Oslo had no common archive plan, and more than 45 autonomous archives. They paid, and we solved the problem.

Data

With all our data, it is the public archives themselves who are the sources in our Oslo-project.
Historic data 6 years back was aquired with FOIA-requests by email.
Data 3 months back and weekly updates from launch, we use webscraping to collect.

For other sources, like the police and armed forces, we subscribe to these records by email, and use an Outlook macro to grab the attachments and send the data to the same Dropbox App folder where we place the scraped files from Oslo.

While doing this, all file names are updated to reflect the source of the document, we add a lettercode to the filename. From Dropbox, the files are uploaded to DocumentCloud for OCR treatment. Then, on a daily basis, we run a script that imports data from DocumentCloud to our Postgres-base where data is structured with a parser assigned to each archive source. At this point, the journal entries are searchable with the full-text search capabilities in Postgres and searches can be made across all sources.

Resources

Four people where involved in the project with the Oslo records. Tarjei Leer-Salvesen was in charge, and gathered the historic data from may to August 2016. Kenneth Lykkås did most of the technical work, assisted by Atle Brandt. frode Nordby designed the new user interface.

Oslo Redaktørforening paid NOK 50.000 the efforts. We invested approximately equally in the project ourselves, to implement the technological improvements we developed for the Oslo records to the rest of our application.

Impact

Innsyn.no has made Oslo municipality more transparent. The mayor of Oslo thanked us for the project: http://journalisten.no/2016/10/oslos-ordf%C3%B8rer-trenger-at-flere-ser-politikerne-i-korta

More importantly, our journalist colleagues in Oslo have gotten the tool they need to do a better job with written sources. After a few months, I hear some of my colleagues there have filed a great number of FOIA-requests through the app. I know it is now being used by a lot of users also outside the journalism-profession.

Obstacles

Some archives did not want to share data at all. We had to use the law as a tool, and file complaints in order to get the data we wanted to work with.

Having collected data from 45 different Oslo-sources, using six different types of archive software, we were faced with the problem that although all records in theory contain the same types of data, all archives need their own parser for text-extraction. The records are PDF documents that are converted to text using Tesseract OCR through DocumentCloud. The text representation vary a lot between the sources. Additionally, there are many edge cases within documents from the same source on text placement. To remedy this, we have utilized several patterns for text extraction, including parsing documents top-down, by pages and by journal entries. Our aim is to structure the data with the best possible quality, although some documents (like when a PDF document is a scan of a printout), might not be handled 100% correctly as of yet.

Go to case website
Built on WordPress by Smart Media AS

Privacy Policy

The Privacy Statement is about how this website collects and uses visitor information. The statement contains information that you are entitled to when collecting information from our website (Personal Information Act, section 19), and general information about how we treat personal data (Personal Data Act, section 18, first paragraph). The legal owner of the website is the processing officer for the processing of personal data. It is voluntary for those who visit the web sites to provide personal information regarding services such as receiving newsletters and using the sharing and tip services. The treatment basis is the consent of the individual, unless otherwise specified.

1. Web analytics and cookies (cookies)

As an important part of the effort to create a user-friendly website, we look at the user pattern of those who visit the site. To analyze the information, we use the Google Analytics analysis tool. Google Analytics uses cookies (small text files that the site stores on the user's computer), which registers the users' IP address and provides information about the individual user's online movements. Examples of what the statistics give us answers to are; how many people visit different pages, how long the visit lasts, what websites users come from and what browsers are used. None of the cookies allow us to link information about your use of the site to you as an individual. The information collected by Google Analytics is stored on Google servers in the U.S.. The information received is subject to the Google Privacy Policy. An IP address is defined as a personal information because it can be traced back to a particular hardware and thus to an individual. We use Google Analytics's tracking code to anonymize the IP address before the information is stored and processed by Google. Thus, the stored IP address can not be used to identify the individual user.

2. Search

If the webpage has search function, it stores information about what keywords users use in Google Analytics. The purpose of the storage is to improve our information service. The search usage pattern is stored in aggregate form. Only the keyword is saved and they can not be linked to other information about the users, such as the IP addresses.

3. Share / Tips service

The "Share with others" feature can be used to forward links to the site by email, or to share the content of social networking. Tips for tips are not logged with us, but only used to add the tips to the community. However, we can not guarantee that the online community does not log this information. All such services should therefore be used wisely. If you use the email feature, we only use the provided email addresses to resend the message without any form of storage.

4. Newsletter

The website can send out newsletters by email if you have registered to receive this. In order for us to be able to send e-mail, you must register an e-mail address. Mailchimp is the data processor for the newsletter. The e-mail address is stored in a separate database, not shared with others and deleted when you unsubscribe. The e-mail address will also be deleted if we receive feedback that it is not active.

5. Registration, form

The website may have a form for registration, contact form or other form. These forms are available to the public to perform the tasks they are supposed to do. Registration form is for visitors to sign up or register. Contact form is for visitors to easily send a message to the website's contact person. We ask for the name of the sender and contact information for this. Personal information we receive is not used for purposes other than responding to the inquiry. The form is sent as email via Mailgun as a third party solution. The entire submission will be stored at Mailgun for 24 hours. Between 24 hours and 30 days, only mailheader is stored before the submission is deleted after 30 days. The reason for this storage is to confirm whether emails are sent from the website and forwarded to the correct recipient. Once the email is received by the recipient, it is up to the recipient to determine the data processing needs of the email.

6. Page and service functionality

Cookies are used in the operation and presentation of data from websites. Such cookies may contain language code information for languages ​​selected by the user. There may be cookies with information supporting the load balancing of the system, ensuring all users the best possible experience. For services that require login or search, cookies can be used to ensure that the service presents data to the right recipient.

7. How to manage cookies in your browser

On www.nettvett.no, read how to configure your browser to accept / reject cookies, and get tips for safer use of the internet.