Chat icon
By Jane Shen on October 02, 2023

Evolution of Question Answering Technology: A six-decade Journey

Artificial intelligence (AI) has been growing rapidly in recent years, revolutionising various digital services. One area where AI has made a significant impact is in customer service. Conversational agents like chatbots and digital employees have allowed businesses to interact with customers in a more personalised and efficient way.

But what exactly powers these conversational agents? The answer lies in an area called question answering (QA). The history of QA can be traced back 60 years. Over the course of six decades, QA systems have undergone remarkable advancements, evolving from early rule-based models to the sophisticated Generative AI based approaches we see today. Let's dive in and discover the incredible progress that has been made in this ever-evolving field.

The early Rule-based QA systems

In the 1960s and 1970s, QA systems were designed using structured databases that followed specific rules. These databases contained lists and graphs with specified entities. To retrieve answers from these databases, queries had to be formatted in a certain way or follow specific templates. For example, the Baseball program was developed to extract answers from a structured database, but the questions had to be highly constrained for the system to work effectively. Another well-known example is Eliza, a chatbot that turned user statements into questions. These early QA systems required questions to be in a particular format, and the answers were retrieved from well-formed databases. These systems laid the foundation for the sophisticated technology we have today.

The 1980s and 1990s marked a pivotal time for linguistic techniques and lexical resources in QA systems. Innovative techniques and resources were created and utilised during this period. TELI, for instance, was designed to handle comparatives by transforming natural language sentences into executable expressions. This breakthrough allowed for more efficient processing of complex linguistic structures. Another system, PARIS, utilised a large knowledge base structured around the lexical database WordNet to extract answers. Its inference rules enabled more efficient retrieval of information. This advancement enhanced the capabilities of QA systems, bringing a closer human-machine interaction.

These early QA systems were limited by their structured databases and constrained question formats. They could only provide answers within a narrow scope of information. The types of questions they could execute were strictly constrained by query transformation rules. As a result, the capabilities of these rule-based QA systems were heavily limited.

Feature Engineering based QA Systems

Since the 2000s, machine learning (ML) algorithms have propelled the question answering field to a new height. ML-based QA systems, also known as feature engineering-based systems, infer answers from a set of features extracted from relevant natural language documents. These systems consist of three main components: question processing, document processing, and answer processing. 

The question processing component classifies questions into different categories according to answer types. The document processing component retrieves relevant documents and filters them based on defined features. The answer processing component identifies answer candidates and validates them. These ML-based QA systems combined natural language processing (NLP) technology with AI techniques, resulting in more sophisticated architectures and capabilities. IBM Watson, built on the DeepQA architecture, showcased outstanding performance on the TV quiz show Jeopardy! in 2010.

Recurrent Neural Network based QA Systems

In the ever-evolving landscape of AI technology, artificial neural networks (ANNs) have emerged as a game-changer in problem-solving. Unlike traditional methods that rely on manual feature extraction, ANNs have the remarkable ability to automatically learn meaningful representations from input and output data. This revolutionary approach has paved the way for neural network models to be trained in an end-to-end fashion, making them incredibly powerful tools for a wide range of natural language processing (NLP) tasks.

Among the different types of ANNs, recurrent neural networks (RNNs) stand out as a unique and powerful variant. RNNs, especially in the form of Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU), excel at capturing sequential relationships within data, making them the ideal choice for NLP tasks like question answering. 

It was in the mid-2010s when RNN-based QA systems started to gain prominence. These systems leveraged the unique capability of RNNs to model the sequential dependencies in natural language. Prominent RNN-based QA systems like BiDAF and DCN made significant progress in extracting answers to questions from passages or documents. What sets these systems apart is their ability to be trained in an end-to-end manner, using a set of triples consisting of a passage, a question, and an answer. This eliminates the need for extensive feature engineering, making them a special breed in the field of QA systems.

Language Model based QA

In the world of AI, more specifically deep learning, a special type of architecture called Transformers were developed in 2017. Transformers are designed to process sequential data, such as text, by paying attention to different parts of the input sequence. This attention mechanism allows the model to capture complex relationships and dependencies within the data, leading to improved performance on various NLP tasks.

Based on the Transformers’ architecture, a group of pre-trained language models gained popularity. These models are trained on a vast amount of unlabelled data, enabling them to learn the underlying patterns and structures of language. By leveraging this pre-training, these models can achieve impressive results on specific-domain using different tailoring strategies such as finetuning.

Some notable examples of pre-trained language models include GPT-1 , BERT, and GPT-2. These models were all trained on massive amounts of datatokens, allowing them to grasp the nuances of language. GPT-1, for instance, can excel in question-answering tasks when fine-tuned on a limited amount of labelled data. On the other hand, BERT can be customised to become a powerful question-answering model by adding an additional output layer during the fine-tuning process.

However, it was GPT-2 that truly pushed the boundaries of language models. It introduced a new concept called zero-shot learning, which enables the model to transfer its knowledge to a new NLP task, such as question-answering, without any specific fine-tuning. This groundbreaking approach opened up new possibilities for language models to tackle task-specific problems such as QA without any explicit supervision. It paved the way for the succeeding large language models (LLMs).

Large Language Models based QA

With the continuous advancement of language models, the scale has gone up along with the generative capability. These LLMs have shown amazing power of generating human-like outputs, like text, code, images, audios and videos. A typical example is GPT-3 and its updated versions such as text-davinci-003. Generative AI backed by these LLMs started to be widely used and recognized.

The year 2022 has seen a groundbreaking development in the field of Generative AI with the emergence of ChatGPT.  It has undoubtedly made an awe-inspiring impact on human knowledge on the capability of AI. This advanced language model has revolutionised the way we interact with technology and has opened up new possibilities for human-machine communication. With its ability to generate human-like responses, ChatGPT has not only surpassed previous language models but has also pushed the boundaries of what we thought was possible.

One of the most thrilling features of ChatGPT is its remarkable ability to answer questions. With just a simple query, ChatGPT can provide accurate and informative responses. This is particularly effective for open-domain questions, where its vast pre-trained data allows it to provide comprehensive responses. However, when it comes to specific domains, such as a client's data, ChatGPT struggles to answer accurately. Often it gives hallucinated answers, or it simply admits its lack of knowledge to those questions. This limitation arises from the model’s lack of specific knowledge in that particular domain.

To address this limitation, an approach called retrieval augmented generation (RAG) has been introduced. RAG is used to do QA over specific knowledge sources, which include documents in various formats like text and pdf, websites, combination of them and more. RAG combines the power of LLMs with an information retrieval mechanism that accesses domain-specific knowledge sources. By incorporating this retrieval system, LLMs can provide answers within the given domain. This makes it easier for users to obtain reliable information on a wide range of subjects.


In summary, the progress of question answering in the last sixty years has been truly remarkable. From early rule-based systems to sophisticated generative AI based approaches, technology in QA has come a long way. The advancements in AI technology have transformed the way we interact with information. As we continue to push the boundaries of AI, the future of question answering holds even more exciting possibilities.

If this article catches your attention and you would like to use it or part of it in your own work, we kindly ask that you give credit where credit is due. Citing our blog not only acknowledges our hard work, but it also helps others find the original source of the information.

Published by Jane Shen October 2, 2023