How I Solved It is a new content series in which expert engineers, data scientists, product managers, and AI experts from A.Team’s network of top tech talent reveal how they’re solving some of the biggest technological challenges today. In this edition, Richard Abrich — a top fractional CTO and AI expert in the A.Team Network — reveals a novel solution for one of the biggest challenges businesses face today: Extracting transformative business insights from a mess of unstructured data.

A Note on Terminology: Throughout this article, we use the term "unstructured data" to refer to both semi-structured data (like surveys with predefined questions but free-form answers) and completely unstructured data (like free-form text documents). While this case study focuses on semi-structured survey data, these approaches and techniques can be extended to handle fully unstructured data sources through additional preprocessing steps, which we'll explore later in the article.

Introduction

Enterprise companies are sitting on a goldmine of untapped business intelligence. My work with Fortune 500 companies reveals that most are unable to extract value from the 80% of their business data that remains unstructured, representing millions in unrealized revenue and missed opportunities.

Unlocking insights from your data is one of the biggest and clearest paths to ROI for most businesses, but to date, it’s proven incredibly tricky.

Consider one of my recent clients, a major media company. Despite investing heavily in customer research and collecting millions of survey responses, they struggled to answer crucial business questions like, "How many affluent women in the Northeast prefer red boots?" These insights could directly inform product strategy, marketing campaigns, and new revenue streams, but the manual analysis required to extract these insights was slow, expensive, and often incomplete, causing them to lose market opportunities to more agile competitors.

This challenge is universal across industries. Companies have vast troves of valuable information locked away in unstructured formats: customer emails, survey responses, social media interactions, support tickets, and more. Traditional analytics tools can't effectively process this unstructured data, leaving business leaders flying blind on critical decisions or relying on gut instinct where data could provide clarity.

As a fractional generative AI consultant, I’ve observed a clear pattern: companies are increasingly looking to generative AI not just for chatbots or content generation but as a key to unlocking this wealth of unstructured data. However, turning this aspiration into reliable, business-ready insights requires solving several critical challenges.

This article details how I built a system that helped my client use their data as a competitive advantage, enabling real-time access to customer insights that previously took weeks to generate.

While our implementation focused on survey data, the underlying approach can be generalized to unstructured data — text, images, audio, video, or digital workspace analytics. The key is systematically converting unstructured data into queryable attributes that unlock valuable business insights.

My hope is that it will help you unlock one of the biggest opportunities for driving true ROI from your generative AI initiatives.

The Challenge

Turning unstructured data into structured queryable information posed multiple technical challenges. The dataset's variability was immense, making it impractical to account for all possible responses or queries manually.

Two common approaches to handling unstructured data with AI are direct prompting of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). RAG enhances LLM responses by first retrieving relevant information from a knowledge base. However, both approaches have significant limitations in this context.

While LLMs can generate detailed responses to natural language queries, they’re prone to hallucination, producing plausible but factually incorrect outputs. When handling complex, aggregate queries like, "How many women in the Northeast United States who earn over $100k prefer red boots," direct prompting often leads to unreliable or outright incorrect results because the model cannot inherently perform computations or verify its claims against the source data.

Retrieval-Augmented Generation (RAG) enhances LLMs by providing context retrieved from a knowledge base, but it still falls short when dealing with aggregate statistics. RAG frameworks excel at providing document-level or passage-level context. However, they lack mechanisms to perform operations like counting, filtering, or aggregating data across multiple records. For example, RAG can retrieve survey responses about color preferences, but it cannot compute how many respondents meet specific conditions (e.g., income, location, and preference criteria) without additional computational layers.

These limitations underscore the need for a more robust approach that leverages SQL's power for computing aggregate statistics across large datasets, a task that LLMs inherently struggle with. By combining SQL's precise numerical calculations and data aggregation capabilities with the adaptability of LLMs for interpreting unstructured data, we can create a system that complements natural language processing with traditional engineering techniques for reliable, consistent computations.

Our solution needed to meet several key objectives:

1) Provide accurate quantitative insights about unstructured data via natural language.

2) Ensure consistency and handle non-determinism.

3) Resolve errors automatically.

4) Enable full auditability, transparency, and debugging.

How I Solved It

One way to think about prompting large language models is that they operate with a fixed cognitive budget. The engineer's role is to direct this cognitive budget as effectively as possible toward solving a single, well-defined problem. This involves providing the model with all the necessary context while eliminating ambiguity, ensuring the highest likelihood of accurate results.

To address the challenges of extracting precise insights from unstructured data, we developed a system that deconstructs the problem into well-defined steps that maximize clarity and efficiency at each stage. It combines the adaptability of LLM-powered pipelines with the reliability of traditional data engineering principles, transforming unstructured data into queryable structures and enabling NL2SQL (Natural Language to Structured Query Language) for natural language querying for accurate and precise information. Here's how it works:

1. Data Preparation: Before user interaction, the system ingests the available data and transforms it into a format that enables seamless processing and querying. This involves extracting raw data, defining structured attributes, and populating attribute values for downstream operations.

a. Data Extraction: The system begins by ingesting raw data from all available sources—in this case, a survey API containing respondents, questions, and responses. The initial data is stored in its original form in a relational database, preserving any relationships or structures in the source data. This ensures you don’t lose any context in downstream processing.

Figure 1: Data Extraction

b. Attribute Definition: After the initial data extraction, the system defines structured attributes for each question. These attributes are derived based on the content of questions and answers and the types of possible answers (e.g., boolean, string, float). Attributes are dynamically named using descriptive prefixes and suffixes to ensure clarity and to maintain context for downstream queries.

The extracted attributes are stored in an attributes table, where each attribute is linked to its corresponding question. The naming convention ensures compatibility with automated pipelines and natural language queries. For instance, a multiple-choice question about color preferences might generate attributes such as preferences__likes_red (boolean) or preferences__primary_color (string). This structure lays the groundwork for mapping responses to respondent-level attribute values.

Figure 2: Attribute Definition

c. Attribute Value Extraction: Once the attributes are defined, respondent answers are processed to populate attribute values, which are stored in an Entity-Attribute-Value (EAV) [2] model. Each answer is mapped to a specific attribute and respondent, creating a highly flexible and extensible structure.

For example, a respondent's answer to a multiple-choice question about favorite colors would result in a boolean attribute value (preferences__likes_red = True) or a string attribute value (preferences__primary_color = "blue"). These values are linked to the original answer, ensuring traceability and consistency.

Figure 3: Attribute Value Extraction

2. Query Processing: Once the data is prepared, the system processes user-submitted natural language queries to extract actionable insights. This process involves dynamically generating SQL views, executing tailored SQL statements, and translating the results into human-readable outputs.

a. Relevant Attribute Identification

The system analyzes the natural language query to determine the most relevant attribute prefixes. Prompt-driven logic matches the query's intent with the available attributes, focusing on the ones most pertinent to the question.

Figure 4: Relevant Attribute Identification

b. Dynamic View Creation

The attribute values table is stored in an entity-attribute-value (EAV) format, which poses a challenge for generating useful SQL queries. In this format, the schema consists of only three columns—entity, attribute, and value—whose schema offers the model no direct insight into the content of the data. To overcome this, the system generates a relevant attribute values table, where columns correspond to the relevant attributes identified from the query. This transformation allows us to embed a semantically meaningful schema into the prompt, providing the model with sufficient context to construct accurate and useful SQL queries.

Given the many attributes in the attribute values table, including all attributes in the dynamic view is impractical, as this would exceed the model’s context length and dilute the signal-to-noise ratio. By limiting the dynamic view to only the attributes relevant to the query, the system ensures that the model receives a focused and manageable schema, maximizing its ability to generate precise SQL queries.

To achieve this, the system generates a view name using a combination of timestamps and a hash of the relevant attribute names, ensuring unique and consistent naming. Column aliases are dynamically created to handle constraints like PostgreSQL's 63-character limit. The truncated column name used in the view is paired with the full column name in the prompt. This approach gives the model the necessary context while adhering to database constraints. These views are temporary and are dropped once the query execution is complete, minimizing database overhead while maintaining scalability and efficiency.

Figure 5: Dynamic View Creation

c. Natural Language to SQL (NL2SQL)

Using the schema of the typed dynamic view, the system generates SQL statements that align with the query's intent. A template-driven approach ensures compatibility with the dynamic view and supports various query complexities.

‍

Figure 6: Natural Language to SQL (NL2SQL)

d. SQL Execution

The generated SQL is generated against the database containing the dynamic view.

‍

Figure 7: SQL Execution

e. Reply Generation

Once the SQL successfully executes and retrieves data, the system summarizes the results into a reply to the original natural language query in human-readable format. This final output is generated using prompt-driven logic that translates raw database results into concise, query-relevant insights.

Figure 8: Reply Generation

3. Addressing Non-Determinism: The system addresses the challenge of unpredictable LLM outputs and the evolving nature of attribute sets by leveraging dynamically generated Pydantic models and OpenAI's constrained decoding via Structured Outputs [3]. This ensures that outputs adhere strictly to a predefined schema, eliminating ambiguity and errors caused by inconsistent or invalid responses.

Constrained decoding works by tightly defining the structure of expected outputs through JSON schemas, which can be generated from Pydantic models. The LLM is prompted to produce responses that strictly match the given schema, such as specific fields with defined data types (e.g., string, boolean, float). This provides a reliable guardrail, ensuring that model outputs are valid and predictable.

To implement this approach, the system dynamically creates Pydantic models tailored to the specific attributes and queries being processed:

1. Dynamic Model Generation: Using pydantic.create_model [4], the system generates schema-specific models on the fly based on the attribute set relevant to a query. Each model validates the output format against the expected types of attributes. For instance:

attribute_name: Constrained to a dynamically generated enumeration of valid attribute names.

attribute_value: Validated against the corresponding attribute's expected data type (str, float, bool, etc.).‍

2. Structured Outputs: When querying the LLM, the system includes the dynamically generated schema as part of the input prompt, instructing the model to generate responses in a strict format. This leverages OpenAI's constrained decoding, ensuring that the output precisely matches the schema with “100% reliability”:

While sampling, after every token, our inference engine will determine which tokens are valid to be produced next based on the previously generated tokens and the grammar rules indicating which tokens are valid next. We then use this list of tokens to mask the next sampling step, effectively lowering the probability of invalid tokens to 0.

‍–Michelle Pokrass, OpenAI [5]

For example, if a query requires filtering based on color preferences, a dynamically generated schema might look like this:

By constraining the decoding process to this schema, the LLM produces only valid outputs like:

This approach combines the adaptability of language models with the rigor of traditional type-checking, ensuring that outputs are context-specific and schema-compliant. As a result, the system achieves high reliability and precision, even when handling diverse and evolving datasets.

3. Exception Handling: The system incorporates robust exception-handling mechanisms to ensure reliability and minimize disruptions. If an error occurs—such as a malformed SQL query, incomplete LLM response, or invalid attribute mapping—the system captures the context of the error and reintegrates it into the pipeline. This iterative approach enables dynamic adjustments, allowing the system to refine its output without requiring manual intervention.

4. Auditability and Traceability: Every AI-generated operation in the data pipeline is meticulously logged into the database to provide full auditability and traceability. From initial attribute extraction to query resolution, the system records all interactions, including prompt inputs, LLM outputs, and retry attempts. This comprehensive logging enables clear visibility into the decision-making process at every stage, simplifying debugging and ensuring accountability.

The final implementation delivered a scalable, flexible pipeline that could adapt to different input formats and reliably generate structured, queryable data from unstructured sources. This approach allowed the client to extract precise insights at scale, unlocking new revenue opportunities and improving their ability to make data-driven decisions.

Figure 9: System Overview (Semi-Structured Data)

Extending to Unstructured Data

Our implementation focused on semi-structured survey data, where respondents naturally served as entities. However, the EAV architecture we developed is equally powerful for handling completely unstructured data sources like customer emails, social media posts, or support tickets.

The critical additional requirement for unstructured data is accurate entity identification. While survey data presents respondents as clear entities, unstructured sources require that we first identify relevant entities—like customers, products, or business interactions. The accuracy of downstream analysis depends heavily on precise entity extraction, making this step crucial for reliable results.

To implement accurate entity identification, the system can leverage:

A) Unique identifiers (customer names, email addresses, phone numbers, account IDs, physical addresses)

B) Metadata (timestamps, filenames, source systems, document types, authors)

C) Cross-referencing with existing customer databases or CRM systems

Once entities are accurately identified and validated, the rest of our pipeline remains unchanged. The same attribute and value extraction processes we developed for survey responses work seamlessly with unstructured content, transforming it into queryable business intelligence.

Figure 10: System Overview (Unstructured data)

Future Work

While the current system effectively transforms semi-unstructured data into actionable insights, several promising directions for future enhancements could further improve performance and generalization.

Query Reuse through RAG

Rather than directly executing new queries, we can implement a RAG system to retrieve and adapt previously successful queries. By maintaining a knowledge base of past query-result pairs, the system can retrieve similar historical queries when faced with a new question. This approach would:

A) Match incoming natural language queries with semantically similar past queries

B) Retrieve the corresponding SQL and execution patterns that worked successfully

C) Adapt the retrieved queries to the current context if needed

D) Fall back to generating new queries only when no suitable matches are found

This pattern would reduce computational overhead by avoiding redundant query generation while maintaining consistency in how similar questions are answered over time. The RAG system would continuously learn from successful queries, building an increasingly valuable repository of proven query patterns that can be reused and adapted.

Fine-Tuning Through Usage Patterns and Feedback

The system could be enhanced by implementing a learning loop that combines successful query patterns with user response feedback. By tracking which natural language queries produce valuable insights (based on user ratings and engagement) and logging the corresponding prompt-SQL-response chains, we can fine-tune the LLM to better understand what makes an effective interaction. This would improve the system's ability to handle common questions and generate more relevant responses while adapting to each organization's specific needs over time.

Adaptive Query Planning: Structured Workflows or Autonomous Agents

Implementing a query planning mechanism could improve reliability and accuracy. Complex queries can be broken down into smaller, manageable steps, where each step is evaluated independently. This stepwise approach would reduce the likelihood of errors, ensure greater transparency in query processing, and provide intermediate results that offer users insight into the system's operations. A modular framework like this would enhance both adaptability and the ability to handle more sophisticated queries.

This approach can be implemented as either a workflow or an agent, as defined by Anthropic:

Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Conversely, agents are systems where LLMs dynamically direct their processes and tool usage, maintaining control over how they accomplish tasks.

–Erik Schluntz and Barry Zhang, Anthropic [6]

The current system operates as a workflow, with query planning and stepwise evaluation following a predetermined sequence of operations. The LLM assists at each step but adheres to a script defined by the developers. This approach offers predictability and consistency, which is crucial for well-defined tasks and regulatory compliance. It has proven effective in handling the complex queries encountered so far, allowing for reliable and auditable query processing.

Extending this to an agent-based system would grant the LLM more autonomy in the query planning process. An agent could dynamically decide how to break down complex queries, when to use subqueries or temporary tables, and how to optimize the query execution plan based on its analysis of the data and intermediate results. This approach could potentially handle a broader range of query complexities and adapt more flexibly to different data scenarios. However, it also introduces challenges such as reduced predictability, increased complexity in oversight and validation, higher computational costs, and the potential for suboptimal decisions.

Expanding Data Sources and Types

While our current system focuses on processing survey data, future iterations could significantly expand the range of data sources and types it can handle, including multimodal data. This expansion would enhance the system's versatility and provide more comprehensive insights. Some potential additions include:

1) Text-based Content: Processing diverse textual data sources — including social media feeds, customer support interactions, and internal documents — to analyze sentiment, trends, customer feedback, and organizational insights. This encompasses data from platforms like Twitter and LinkedIn, as well as chat logs, emails, call transcripts, company reports, and memos.

2) Multimedia Content: Integrating image, audio, and video processing capabilities to extract insights from visual and auditory data. This could enable correlating textual survey responses with product images, analyzing sentiment in written and recorded customer calls, or combining video metadata with viewer comments.‍

3) Digital Workspace Analytics: Leveraging tools like OpenAdapt [7] to capture and understand how employees actually work with their software applications in real-world scenarios. This provides unprecedented visibility into daily digital workflows, revealing opportunities for efficiency improvements, training needs, and automation possibilities that traditional analysis methods might miss. By understanding how people interact with their digital tools, organizations can make data-driven decisions to optimize productivity and reduce friction points in daily operations.

By expanding the types of data sources, the system could offer a more holistic view of business operations, customer behavior, and market trends. This would enable organizations to make more informed decisions based on a broader spectrum of information.

Conclusion

Extracting actionable insights from your data is arguably the biggest roadblock to ROI for most businesses. It’s time for us to get over that hurdle.

Combining prompt-driven pipelines with traditional data engineering practices has transformative potential. By leveraging the adaptability of LLMs and the reliability of structured data processing, we successfully addressed the challenges of non-determinism, exceptions, and unstructured data variability, unlocking precise insights at scale and enabling measurable business outcomes.

This hybrid approach is a blueprint for organizations grappling with large, complex datasets. As language models evolve, integrating them with structured pipelines will unlock even greater potential for automation, efficiency, and decision-making. For technical leaders, the lesson is clear: evaluate your organization’s data challenges today and invest in systems that marry traditional engineering rigor with cutting-edge AI. Before long, the path to driving real business value will be much more clear.

Richard Abrich is a top fractional CTO and AI Expert in the A.Team Network, and the founder of OpenAdapt.AI. Follow him on LinkedIn.

Need help solving a problem like this with an AI expert like Richard? Request a consultation here.

‍References

‍[1] https://www.ibm.com/think/topics/structured-vs-unstructured-data

‍[2] https://en.wikipedia.org/wiki/Entity-attribute-value_model

‍[3] https://platform.openai.com/docs/guides/structured-outputs

‍[4] https://docs.pydantic.dev/latest/api/base_model/#pydantic.create_model

‍[5] https://openai.com/index/introducing-structured-outputs-in-the-api/

‍[6] https://www.anthropic.com/research/building-effective-agents

‍[7] https://github.com/OpenAdaptAI/OpenAdapt

No items found.

Tag: