Before diving into the phenomenon of emergence in Large Language Models (LLMs) like ChatGPT, there are a few key concepts to consider. The three primary concepts this post will consider are: complexity, complex systems, and chaotic behavior. These concepts will be important when opening up the discussion to engineering concerns later on.
Complexity is perhaps the most important of the three concepts introduced for LLMs. Complexity refers to the state of being composed of interconnected parts or elements that exhibit diverse behaviors and properties. Complex systems, are systems made up of many interacting components, where the behavior of the system as a whole emerges from the interactions between those components. This can lead to emergent properties that are not present in any individual component, but arise from the collective behavior of the system. An LLMs behavior emerges to the degree that there are interactions between the components, with more interactions producing more complex emergent behavior.
Complex systems can also exhibit chaotic behavior, where small changes in the initial conditions of the system can lead to drastically different outcomes. Chaos in complex systems is not necessarily a lack of order, but rather a highly complex and unpredictable form of order that is difficult to understand or predict. We may know about the multi-layers of the GPT Transformer network, but the astronomically high number of possible node connections effectively makes the output unpredictable.
Large Language Models like ChatGPT are complex systems, made up of many interconnected parts, including the vast amounts of text data used for training, the model architecture, and the training process itself. As the size and complexity of LLMs continue to increase, we can expect to see more emergent properties and chaotic behavior in their output. A common misunderstanding is that chaos is random. In Mathematics and Mechanics, chaos is apparently random or unpredictable behavior in systems governed by deterministic laws. While the multiple layers of the Transformer network, saturated with billions of training parameters is incredibly complex, it is still deterministic. This apparently unpredictable behavior of networks, such as those in LLMs, are a contributing factor to the unpredictability present in the phenomenon of emergence. Understanding emergence in LLMs is key to unlocking the full potential of these powerful language models, as well has handling the fuzziness inherent with their use.
With a working understanding of complexity in large systems, we can now take a closer look at emergence of novel behavior in LLMs such as ChatGPT and large generative models in general. Emergence is a phenomenon not fully understood and a superpower of LLMs trained on high volumes of data.
At a fundamental level, complexity gives rise to the emergence of features in LLMs. As the model learns from increasingly large and diverse datasets, it develops a more nuanced understanding of language, including grammar, syntax, and semantics. This knowledge is distributed across the many layers and neurons of the model, resulting in emergent properties that are not easily attributable to any individual component.
One example of emergence in LLMs is their ability to complete sentences or paragraphs in a coherent and contextually appropriate manner. By analyzing the context and the partial sentence or paragraph provided, the model is able to generate text that is consistent with the language used in the input. Another example is the ability of LLMs to translate text from one language to another, even if the model has not been explicitly trained on that language pair. This emerges from the model’s ability to learn the underlying structures and patterns of language, rather than simply memorizing specific translations. With low training data these behaviors are not present in LLMs. And fascinatingly, there are discrete inflection points where the the model transitions from not having the capability for a certain behavior to possessing it.
Overall, emergence in LLMs is a powerful property that enables these models to generate highly realistic and contextually appropriate text. As the size and complexity of LLMs continue to increase, we can expect to see even more impressive emergent behavior in the future.
So what are some examples of LLM behaviors that have emerged with increasing the input parameter training data?
Language Translation: LLMs can translate text from one language to another, even if the model has not been explicitly trained on that language pair. As the training data parameters are increased, LLMs can learn to generate more accurate and fluent translations.
Summarization: LLMs can summarize large amounts of text into shorter, more concise summaries. As the training data parameters are increased, LLMs can generate more accurate and informative summaries.
Question Answering: LLMs can answer questions based on a given context, such as a paragraph of text. As the training data parameters are increased, LLMs can generate more accurate and relevant answers to a wider range of questions.
Sentiment Analysis: LLMs can analyze the sentiment of a given text, such as determining whether a review is positive or negative. As the training data parameters are increased, LLMs can generate more nuanced and accurate sentiment analysis.
A more fun example comes from the paper, Beyond The Imitation Game: Quantifying And Extrapolating The Capabilities Of Language Models. The prior expectations were based on a gradual, steady increase in the ability for LLMs to identify a movie based on a set of emojis. Instead what researches found in June 2022 was that some behaviors were developed linearly as parameter dataset grew, while others grew non-linearly and had “breakthroughs” at certain parameter input levels.
LLM Emergence of Emoji to Movie Prediction Behavior.
Ref: Arxiv
The above example comes from asking LLMs with various training parameter sizes the question: “What movie does this emoji describe? 👧🏻🐟🐠🐡”. And for reference, 0-shot means one try, no guess, no further context.
LLM Emergence of Emoji to Movie Prediction Behavior.
Ref: Arxiv
Between effective parameter count (training dataset size) 10^10 and 10^11 a clear inflection or breakthrough occurs. This example demonstrates the non-linear emergence of a behavior that, prior to 10^10 training parameter size, was through to be not effectively supported by this model and approach. This can be a blessing and curse for large generative systems, more on AI ethics, risks, and concerns in another post. In the next section we’ll ground the discussion in an engineering context and ask how to build maintainable, scalable, deterministic systems with LLMs and generative models more generally.
As we’ve seen, complex and chaotic systems can pose challenges for software engineers working with Large Language Models (LLMs) like ChatGPT and large generative models in general. One important consideration is how LLMs can interface with more predictable and deterministic software systems, such as those used for content management, user authentication, or payment processing. As LLMs gain increasingly complex behaviors, and AI tools are increasingly integrated into the SaaS and application ecosystem, being able to fully and safely leverage these tools will be essential for success.
One approach to addressing this challenge is the use of ChatGPT plugins. Plugins are software components that can be added to an existing system to provide additional functionality or to solve specific problems. For example, a plugin could be used to interface with Wolfram for computation, or to integrate with third-party services for content moderation or user authentication. The Wolfram plugin is a remarkable example that allows LLMs such as ChatGPT to convert a subset of user prompts requesting calculations to Wolfram language for consistent, accurate, computations. In the link above, the MLST podcast discusses how the Wolfram plugin effectively integrates with LLMs allowing for complex and current computation queries to be made. The synergy is that LLMs also abstract the complexity of Wolframs language, allowing users to use natural language.
The emergence of new behaviors in LLMs can also pose challenges for ensuring a consistent user experience, content security, and AI safety. For example, LLMs may generate text that is inappropriate or offensive, or that exposes sensitive information. These issues can be difficult to predict and mitigate, and may require new approaches to content moderation and user feedback. One key will be thoroughly categorizing the various use cases for LLM products and services and maintaining a strategy for each bucket or category of use case. Further, it makes sense to abstract the LLM and give users an interface that allows SaaS products and apps to expose specific surfaces for the LLM use. The exposed surfaces can be more effectively managed, risk-mitigated, and maintained. Below is a basic example, where the LLM service is abstracted from the application using Azure OpenAI, and the backend or middleware has opportunity to validate, check, filter, and otherwise perform quality control on the LLM output that’s customer facing. Collectively these strategies can increase quality for users, simplify application design with a modular approach, and provide layers for various quality assurance checks.
Modular LLM architecture overview with Azure OpenAI service.
Software engineering must continue to evolve in response to the challenges posed by generative AI and the emergence of new behaviors in LLMs. This may involve the introduction of plugins and other tools for abstracting and solving edge cases where ChatGPT cannot provide an adequate answer. It will also require new approaches to content moderation, user feedback, and AI safety, to ensure that LLMs can be used responsibly and effectively in a wide range of applications.