Daniel Soares
See profileIrina Ionita
See profileHow to Speed-up Risk Data Collection
A collaboration between MapAction, Analytics for a Better World, and Pipple
The key step of any data and analytics project? Data, data, data, and did we mention collecting the right data? Easier typed than executed, right?
When data sources are in well-structured, actionable, and most importantly accessible formats this task is very simple, otherwise however… it can be as time-consuming as working with Internet Explorer in 2024. So, how to keep data-collection a task that can boost your efficiency (and enthusiasm) for a data and analytics project?
MapAction, Analytics for a Better World, and Pipple joined forces in 2024 for an information extraction pilot project with the aim of speeding-up the data collection for risk projects to see how exactly that can be accomplished.
How does MapAction take action?
MapAction is an international charity based in the United Kingdom specialising in Information Management for Disaster Response and Preparedness, founded in 2002 and have since deployed to more than 130 emergency responses and 500 preparedness and capacity building missions. It follows a mixed model with 32 staff members and 80 highly skilled GIS and data volunteers. In recent years, MapAction has developed new areas of work, including Disaster Risk Reduction, Anticipatory Action, Health and Technology and Innovation. In the last months we have, for example, deployed to emergency responses in The Gambia, Belize, Grenada and St Vincent and the Grenadines, developed risk and anticipatory action projects in Eswatini, Madagascar, Ecuador and Colombia, worked with children vaccination information in several countries in West and Central Africa and supported data quality standards with OCHA.
Collaborations for Impact
Analytics for a Better World is a Netherlands based non-profit organisation as a joint effort between ORTEC and the University of Amsterdam. Their vision centres analytics as a powerful tool to reach the Sustainable Development Goals. Analytics for a Better World brings together the combined strengths of nonprofits, the academic, and the business world around the theme of SDG-related analytics in several activities. To empower nonprofits, they educate their C-level executives, management, and specialists in how to use analytics to further their objectives. To deepen their support for nonprofits in creating impact with analytics, they build analytical roadmaps and jointly deploy analytics projects. Some of their most inspiring collaborations have been with the Ocean CleanUp, the 510 Dutch Red Cross Initiative, and their annual fellowship bringing together NGO workers and data specialists from all over the world. Next to that, they conduct and stimulate applied research on analytics aimed to contribute to the SDGs. Because accessibility of knowledge has an impactful role, they share everything they create as open source through their repository.
Pipple is a data & AI agency based in Eindhoven, the Netherlands. They are a team of creative mathematicians and engineers, specialized in solving complex issues through data & AI. Pipple was founded in 2016 and since then has provided more than 200 successful solutions to their customers. They have also had ongoing collaborations with various nonprofit organisations such as the Red Cross, the Ocean Clean-Up and more recently with Analytics for a Better World.
Our goal by working with the INFORM Subnational Risk Index
Firstly, what exactly is the INFORM Subnational Risk Index?
An INFORM Subnational risk index shows a detailed picture of risk and its components within a single region or country. It covers not only hazards exposure (e.g. earthquakes, floods and conflicts) but also a country’s vulnerabilities, such as diseases prevalence and poverty, as well as its coping capacity.
Since July 2023, in partnership with the European Commission’s INFORM Risk Index, MapAction is working to support national and subnational disaster managers to update or rebuild their disaster forecasts, mitigating tools, and risk atlases. During this period, it has worked on projects in Eswatini, Saint Kitts and Nevis, Niger, Lebanon and Madagascar.
Figure: MapAction volunteer Tom Hughes presenting the INFORM methodology during a simulation exercise
An important part of an INFORM Risk project is Data Collection on hazards, vulnerability and coping capacity. Because we are looking for subnational data (region, department, district, etc.), we sometimes find this data in a PDF-text report instead of a spreadsheet or a geospatial file. Data often finds its death in PDF-texts due to its format inaccessibility, therefore having a basic tool that allows one to scan big reports in search of tables and/or key indicators would save lots of time.
A new life for data trapped in PDFs
For this pilot project between MapAction, ABW and Pipple, the target was to develop a tool that takes as input one PDF file and gives as an output one or several spreadsheets with the data indicators per administrative division, plus any relevant metadata. The tool should be able to accommodate multiple languages and be written as a Python script.
This project was expected to have following impacts:
- Reduce the time needed to collect data from national or subnational reports,
- Enable the exploration of a larger set of reports and sources,
- Increase risk model completeness and enable national and regional disaster management agencies to make more informed decisions.
After the initial development phase by ABW and Pipple, MapAction now enters the utilisation phase where their team will adapt the scripts to its workflow and ongoing projects.
Because accessibility of knowledge betters the world, Python scripts and user instructions are open source and available on ABW GitHub public’s repository.
How did we make this happen?
Upon first inspection of the sample PDF files, it became clear that we needed to explore different open-source Python libraries to see how they handled non-standard table structures within the documents.
Among the first packages that we explored were:
Both libraries worked well with standard tables, e.g., vertical tables with a single row header and well defined columns; however, when it came to more varied formats, they did not manage to identify the tables correctly.
In continuing our exploration, we tested one more library:
GMFT is a toolkit for converting PDF tables to many formats. It is lightweight, modular, and performant. While still under development, it already works very well and has outperformed the packages tried before. Thus, it was chosen as our final approach in the project. The package works out of the box; however, small alterations were required for better performance for our specific project.
What comes next?
The information extraction code developed over this pilot project will have its first application on MapAction’s support to the Southern African Development Community (SADC) regional subnational INFORM Risk Index. SADC is composed of 16 countries, home to over 360 millions people and over 200 level-1 administrative divisions, which is the granularity of the model. Given the scale and level of detail of this model, information will be assembled from different national reports and the tool developed during this project will be very useful to process a large amount of data efficiently. The project will run from September 2024 to July 2025 in a collaboration between the SADC DRR unit, MapAction, GIZ and UNDP, building from the experience gathered on recent projects in Eswatini and Madagascar.
MapAction team members Daniel Soares and Anne-Marie Frankland, left, in blue t-shirts, together with representatives from UNDP, NDMA and other Eswatini agencies and ministries during the INFORM handover workshop in December 2023.
We hope this first pilot project to be the beginning of a long-term collaboration between MapAction and Analytics for a Better World as both organisations share the same core values of improving lives and reducing suffering through data and technical expertise. This project speaks directly to ABW’s vision of connecting the private sector with the non-profit one, with Pipple’s Data Scientist, Sanne van den Bogaart, playing a key role in the development of the tool.
Better Together – MapAction & Analytics for a Better World
After the success of 2023’s edition, the ABW annual conference had its 2024 edition on May 14th. At this conference, an array of speakers and panelists representing ABW’s key stakeholders was gathered: nonprofits, researchers, and companies. Together, we reflected on the impact and progress of ABW, sharing achievements, and outlining future plans. Engaging discussions delved into pressing topics in analytics, including the challenges posed by AI.
Figure: Participants and panelists at the 2024 ABW Conference in Amsterdam (May 2024)
Among key agencies such as MSF, Save the Children, 510 and UNICEF, Analytics for a Better World kindly invited MapAction to participate in this year’s event. MapAction’s Head of Data Science Daniel Soares took part on the panel “Leveraging Geospatial Data and Big Data platforms for impactful decision-making”. Over the conversation we have discussed MapAction role in both emergency response and preparedness, with examples from our Anticipatory Action and Health programmes.
Analytics for a Better World on NLP, and GenAI work
ABW develops and implements AI and machine learning solutions for the public sector and non-profits. Other collaborations include overseeing the development of two LLMs: Retrieval-Augmented Generation (RAG) chatbot for automated information retrieval and improved response time, as well as, LLMs for automatic text summarization.
Using NLP techniques, including embeddings and topic modelling, ABW is currently developing a system for multimedia topic analysis over time in the area of climate change. They collaborate with multidisciplinary teams to develop projects from the grant proposal and initial stages to implementation, delivery and communication with stakeholders across different levels.