Out-of-the-Box vs. Customized GenAI Bots: Part 1

Until the emergence of generative AI, online text chatbots (and their voice-enabled counterparts) were notoriously ineffective. They struggled to handle complex tasks, their responses were scripted, and you couldn’t easily personalize them, leading many customers to prefer to interact with a human rather than a chatbot.

Fast forward to today, chat-enabled large language models (LLMs) like OpenAI’s ChatGPT have redefined the landscape, proving to the world that chatbots don’t have to be frustratingly limited. In fact, they can deliver highly impressive, customizable user experiences.

Enterprises are starting to embrace these advancements, but with caution. While generative AI offers tremendous business value, there are innumerable business risks: leaking sensitive information, concerns around data security and accuracy (such as “hallucinations”), and the lack of regulation. These foundational risks persist, making the legal and operational landscape around generative AI uncertain for many enterprises.

In this series, we’ll dive into these concerns and discuss how custom language models may provide a safer and more effective approach moving forward.

Part 1: Data ownership and security

The most important thing to understand about intelligent agents is that they’re built on language models–applications designed to understand and generate text like a human would. These models learn how words and sentences typically work together by analyzing example data, which they then use to generate responses.

The data powering these language models makes all the difference in the intelligence and capabilities of the chatbot. This makes data ownership and security a major priority for any enterprise considering a chatbot for their business, especially if their operations involve proprietary information or are subject to regulatory compliance. Using off-the-shelf language models means trusting that the provider has properly trained the model using data that is secure for your business.

It’s also worth noting that modern chatbots are increasingly becoming “multimodal,” meaning they can handle not only text but also voice, images, and even videos, making them more versatile. For the purpose of this article, when we refer to “chatbots” or “intelligent agents,” we’re primarily talking about text- and voice-enabled chatbots.

Chatbots and language models

Traditional text- and voice-operated chatbots are built on rule-based language models. These bots follow predefined rules and never “learn” or adapt from data. The underlying datasets are just big enough to cover the bots’ use cases and nothing more. They’re easy to build and don’t typically leak anything (unless they are hard-coded with sensitive information and someone finds an exploit).

Conversational generative AI products, on the other hand, are built on machine learning models that are trained on much larger datasets. The size and scope of these models can vary widely:

Small models: trained on datasets ranging from a few gigabytes (GB) to tens of gigabytes
Medium models: a few hundred gigabytes
Large models, like ChatGPT and Google Bard, are trained on datasets ranging from hundreds of gigabytes to over a terabyte (TB)

To put it in perspective, 1GB of text is equivalent to around 1,000 500-page books, while 1TB is about 6.5 football fields of 500-page books stacked up.

The larger the model, the more sophisticated its responses, but there’s a trade-off. Larger models are more resource-intensive and expensive to operate. For example, it’s estimated that ChatGPT costs over $700,000 per day to run, and its dataset is over 45TB in size.

Risks of sharing model data also emerge, especially if a generative AI bot is not carefully programmed. Any enterprise deploying a conversational AI system must be mindful of the data behind the system to ensure privacy and security.

Categories of language model data

There are three main types of data in language models:

Input data: Used to create the model
Output data: Generated by the model in response to user inputs
Usage data: Created by users interacting with the model

Input data

Input data is the beating heart of all types of natural language generation, including generative AI. The better the input data, the more accurate and relevant the output will be.

With conversational technologies, “data quality” means relevance and context. For instance, if an Apple customer service bot encounters the phrase "I’m having a problem with my vision," it must interpret this correctly as an issue with Apple’s Vision Pro product—not the customer’s eyesight. The more specific the input data, the more accurately the chatbot will interpret and respond.

fThe best way to handle a situation like this is to provide the language model with custom terminology and other metadata so that it can correctly and helpfully respond to product- or service-specific inquiries in contact center scenarios. This is a matter of appropriately training the model with the right input data.

Risks of poor input data: "Garbage in, Garbage out"

The old saying of “garbage in, garbage out” applies especially to input data. If the model isn’t trained to determine and interpret domain-appropriate language, it is more likely to generate output data that is frustrating and inaccurate for the end user. Having appropriate data allows the model to learn patterns, understand context, and respond intelligently.

Off-the-shelf AI systems like ChatGPT scrape an enormous amount of data from the internet, raising legal and ethical concerns about how data is sourced. This leads to questions about data transparency, copyright infringement, and even plagiarism. Users can’t easily verify what data models like ChatGPT were trained on, and this can pose risks if that data is not ethically sourced.

The larger the model, the larger the training set

The size of the training data correlates with the model’s functionality. Foundational models like ChatGPT and Bard are designed for general-purpose use—they can discuss a wide range of topics but aren’t specialized for particular business needs. While excellent for general-purpose scenarios, foundational models are less appropriate for business-specific applications.

Using a foundational LLM in a business setting is like using a sledgehammer to crack a walnut. For tailored business applications, it's critical to focus on specific, domain-relevant training data to ensure the model’s responses align with the business’s needs.

Output data

Output data is what the chatbot generates in response to user input. Like input data, output data must be tailored to your business context and meet standards for safety and accuracy. A chatbot’s responses directly impact your brand's reputation; inaccurate or off-brand replies can harm customer trust.

Consistency is key, especially when addressing repeated inquiries. Just as you can instruct ChatGPT to emulate a specific writing style, you should be able to direct your chatbot to maintain a consistent tone and messaging that aligns with your company’s voice, ensuring a positive customer experience every time.

Usage data

As your customers interact with AI systems, they generate personal data like account numbers, PINs, and credit card details. It’s vital to ensure that data storage and usage comply with regulations like PCI-DSS, HIPAA, and GDPR.

When selecting a language model vendor, evaluate how they handle data security and compliance. Off-the-shelf models may not guarantee the level of privacy your business requires, but with custom solutions, you can ensure your data is processed in a way that meets legal standards. Your security is only as strong as your weakest vendor.

Custom language models are the way

Custom language models are tailored to understand and generate text specific to your business, industry, or application. Unlike generalized models that cover a broad spectrum of topics, custom models are fine-tuned to address your unique use case. This makes them better suited to support your customers who are using your proprietary products and services.

By building a custom AI solution, you gain greater control over data access, handling, and integration. This can help you meet regulatory requirements while ensuring that your AI is more accurate and aligned with your business needs. Off-the-shelf solutions may not provide this level of control, and migrating to new systems just to work with a specific vendor could create more problems than it solves.

Custom solutions offer maximum ownership of your input, output, and usage data. With this control, you can ensure a higher degree of accuracy, align the chatbot with your specific terminology, and integrate it seamlessly with other business processes.

In part 2, we’ll explore how to measure the effectiveness of intelligent agents built with custom language models and how continuous improvements can drive success in contact center operations. The role of AI in transforming contact centers is growing, and we’ll discuss how AI-backed technology is reshaping the metrics used to measure success.

10DLC: A FreeClimb Guide

10DLC: A FreeClimb Guide

Out-of-the-Box vs. Customized GenAI Bots: Part 1

Data ownership and security

Part 1: Data ownership and security

Chatbots and language models

Categories of language model data

Input data

Risks of poor input data: "Garbage in, Garbage out"

The larger the model, the larger the training set

Output data

Usage data

Custom language models are the way

Table of Contents

Stay Connected

A2P 10DLC Text Messaging Made Easy

10DLC: A FreeClimb Guide

10DLC: A FreeClimb Guide

Contact us

Request a demo

Out-of-the-Box vs. Customized GenAI Bots: Part 1

Data ownership and security

Part 1: Data ownership and security

Chatbots and language models

Categories of language model data

Input data

Risks of poor input data: "Garbage in, Garbage out"

The larger the model, the larger the training set

Output data

Usage data

Custom language models are the way

Table of Contents

Stay Connected

A2P 10DLC Text Messaging Made Easy