Data Engineering in the Age of AI: New Horizons in Data Discovery

In the age of big data, more data is supposed to mean better, data-driven decisions. With vast amounts of information at our fingertips, government agencies and businesses alike should be empowered to make smarter, faster, and more informed choices. But sometimes, the abundance of data isn’t just empowering; it’s overwhelming.
When that happens, finding the right data among endless repositories can feel like chasing shadows. For data engineers and analysts, the challenge is not just about having enough data, but about finding the right data quickly and efficiently. To tackle this issue, GovTech’s Data Engineering Practice team is using artificial intelligence (AI) to rethink the way we approach data discovery.
Playing hide and seek with data
Playing hide and seek at the void deck as kids is all fun and games. But when data plays hide and seek, the fun quickly disappears.
For data professionals, “seeking” often means sifting through countless databases and metadata, trying to locate the exact piece of information needed. This process is not only labour-intensive and time-consuming but also inherently complex. It heavily relies on keyword searches, where users must guess the right terms or pre-defined tags. If you don’t know the exact keyword, good luck finding that elusive data set. This often leads to information overload, where users are swamped by too much metadata and struggle to identify what's truly relevant.
For agencies like the Ministry of Manpower (MOM), which handles between 900 and 1,000 external data requests from other government agencies each year, this aforementioned poses a challenge. Scaling data discovery to meet growing demands requires innovation that can break through current bottlenecks.
AI for data engineering
Recognising these challenges, the Data Engineering Practice (DP) team introduced a powerful ally – artificial intelligence.
By embracing AI-driven data engineering, the team set out to automate and optimise key aspects of the data pipeline, particularly in data discovery. The goal was simple yet ambitious: make finding the right data as intuitive as possible, while reducing the heavy lifting traditionally required from human users.

The DP team focused on two core innovations to enhance data discovery:
-
Embedding search: This technique goes beyond simple keyword searches and uses natural language processing to help users find relevant data based on context and meaning instead of exact matches. It means that users no longer need to guess the right keywords. Instead, they can use natural language queries, and the system will understand the intent, surfacing the most relevant data elements.
-
Graph search: Datasets don't exist in isolation, they are interconnected with each other. Graph search allows users to explore and visualise the relationships between different data elements. It’s like drawing a map of your data to show how one dataset links to another, revealing deeper insights and connections that might have otherwise been missed.
Besides enabling smarter searches, AI also allowed the DP team to offload repetitive, resource-heavy tasks to machines. These tireless algorithms could iterate through vast amounts of metadata, continually enriching it and refining search results. Over time, the system gets better at surfacing the most relevant data, moving towards a globally optimal solution for data discovery.
With machines handling the grunt work, human users can now focus on higher-level tasks — interpreting results, validating insights, and applying their expertise to solve complex problems.
Putting it all together
To make smart discovery data engineering a reality, the Data Engineering Practice team worked over three months to develop a prototype for MOM.
Rather than sticking to existing methods like the single-retrieval vector-based Retrieval-Augmented Generation (RAG)approach, the team pushed boundaries. They introduced a more efficient, flexible, and comprehensive system known as the Multi-Retrieval Agentic Graph-based RAG Approach.
Some of its unique features include:
-
Metadata Knowledge Graph: To uncover deeper relationships between data elements, the team built a comprehensive knowledge graph. Users can discover, organise, and interact with metadata more intuitively, making complex data ecosystems easier to navigate.
-
Automated metadata enrichment: Preparing detailed metadata descriptions can be a tedious task. By using AI to automate this process, metadata is continually updated without requiring hours of manual work.
-
Agentic retrieval: The team also introduced autonomous agents that can make independent decisions and work together with humans to solve problems.
-
Natural language queries: This feature enables users to simply type in questions or requests in plain English, significantly boosting user-friendliness and lowering the barrier to entry for non-technical users.
No more hide and seek
MOM officers now have a powerful new assistant in their data discovery toolkit. What was once a complex, time-consuming process has become faster, smarter, and more intuitive.
The prototype not only streamlines data discovery but also empowers MOM to manage, utilise, and govern data more effectively. With improved discoverability and understanding of high-quality data, officers can make better-informed decisions that drive meaningful outcomes for Singaporeans, foreign employees and businesses.
More innovations to come
As big data gets ever bigger, the game of data hide and seek will only grow more complex. The Data Engineering Practice team will continue to push the boundaries of what’s possible with AI, giving seekers the tools they need to win, and win faster.
Stay tuned for our next tech news article, where we’ll dive deeper into the five components that power this cutting-edge data discovery prototype.