NLP promoting a proactive approach to Open Data

The situation

The volume of Freedom of Information (FOI) requests the Council receives is huge. In the 43 months from December 2019 to July 2023, Essex County Council (ECC) received more than 6,000 FOI requests. This meant that on average, ECC processed six FOIs per day.

In the same period ECC, with its Open Data Steering Group at the helm, has made a concerted effort to share new information and data, believing that open data can provide greater transparency, accountability and collaboration between public sector organisations, citizens and community groups.

As Analysts and open data enthusiasts, we wanted to gain a better understanding of the demand for information – not so much in terms of volume, but rather in terms of content (i.e., what is it that people are asking for? And who is asking us for what?).

We felt that answering these questions would help us to achieve two things:

  1. Inform our open data content to include things that people commonly request from us and thereby, be more open by default,
  2. By making this information and data publicly accessible, increase transparency and reduce the amount of resource ECC spends in processing repetitive information requests.

The what?

To achieve this, we used Natural Language Processing (NLP) to streamline the analysis of FOI requests and enable us to identify frequently requested information. Using NLP, we identified significant words and phrases occurring in the unstructured, free text of FOI requests. This analysis was done by service area i.e., Roads and Travel, Health and Social Care etc. Then, to extract key words and phrases from the text, we created network graphs of pairs of words (bigrams), triplets of words (trigrams), and quadruplets of words (quadgrams or four-grams).

Figure 1: An example of a bigram Network for the Roads & Travel service area

These network graphs highlighted a variety of topics that are frequently enquired about, some of which is already on or will be made available on our Open Data platform. For instance, under Roads and Travel, potholes and road maintenance were frequent topics of enquiry while FOIs directed to Children’s Social Care pertained to aggregate figures of children in care, including demographics, care leavers and separated-migrant children.

The who?

A secondary avenue of research sought to understand the “Who?” As in who typically requests information, and can we understand more about the information that they request?

Using the email addresses provided at the time of enquiry, we were able to identify nine different and distinct profiles of enquirers. As can be seen from the Sankey network diagram below, members of the public make up the largest group of enquirers. A great deal of requests also come commercial requesters, journalists, public servants, academics, and charity or campaign groups.

Figure 2: A Sankey network diagram demonstrating the flow of FOI requests from Profiles to Service Areas

We then repeated the NLP analysis, extracting key phrases into bigrams, trigrams and four-grams, this time using the newly identified profiles as a covariate. From the outputs, we observed that the Academic enquirers tend to ask for datasets with a social or environmental interest. Meanwhile, charities and campaigns groups requested information and data about issues affecting specific causes or interest groups, such as veterans, animal welfare and the environment.

The impact

The work that was undertaken will inform the Essex Open Data Steering Group’s to publication strategy for 2024 and ensure they have a sufficiently active approach to managing the procurement and release of datasets. Next steps involve developing a content plan to proactively publish frequently requested information and data on Essex’s dedicated Open Data website, as well as implement effective signposting across the whole website and engage stakeholders across the county to champion open data.

Share this page

Leave a comment

We only ask for your email address so we know you're a real person