As generative AI tools continue to improve, developers need more data to train models with higher efficacy.
This need for more and more data is directly at odds with prioritizing data privacy. But we can safeguard sensitive data with self-deployed tools.
Here's a deep dive into the thorny data safety issues created by generative AI.
This time last year if you had never heard of GPT-3 or DALL-E 2 you wouldn’t have necessarily been at a disadvantage. Now it seems if you don’t have your ChatGPT tab open at all times you’re lagging behind.
Generative AI increases the efficiency of work across industries by automating rote tasks, accelerating upskilling, and allowing for the quicker development of proofs of concept. The development and improvement of these models, however, comes at a cost. Every time we ask ChatGPT for help we are exposing our data to an external source, risking our privacy.
The threat generative AI poses to data privacy
The key ingredients for the recent breakthroughs in generative AI are model architectures, enormous training datasets, and massive computational resources. The importance of high quality data for training these models means that foundation model providers are incentivized to draw on their customer data for training. Because of the potential of these models to emit sensitive information that is contained in their training data, everyone using gen AI is concerned about one thing: data privacy.
Below is a comparison of the accuracy of several different LLMs based on their number of parameters and amount of training data represented by the number of tokens. The color gradient represents model performance, with a darker color representing higher performance.
This research from Deep Mind shows that smaller models, trained on larger datasets, could out-perform larger models with more parameters, such as GPT-3. In other words, data is the key for scaling model improvement. Thus, smaller open source models with under 100B parameters trained on increasingly large corpuses of text are becoming more and more common. Meta’s Llama 2 models, for example, were trained on datasets of 2 trillion tokens.
Building the next generation of models will require even more data, whether for pre-training or fine-tuning. This demand for data could easily become fundamentally at odds with privacy and security requirements.
When it comes to actually using LLMs in your organization it’s important to be cautious. Your team could be unintentionally sending your data to third parties with disparate data security policies from yours and therefore sharing confidential information in breach of regulations. For example, data submitted to OpenAI from 2020 through March 2023 was potentially used to train future models, such as ChatGPT.
The risk of a model disclosing information from its training set is proportional to the size of the model (the number of parameters) and the number of times that particular data point showed up in the training set. Research by Carlini et al exposed severe training data leakage in diffusion models as shown below.
Training data leakage is a huge liability for an organization. If sensitive data is used in training, it is at risk of being disclosed, plain and simple. Companies like Stability AI and Microsoft are currently facing the consequences, with lawsuits stemming from their models reproducing copyrighted content without proper credit or compensation to the original creators.
All is not lost, however, when it comes to preserving data privacy while also reaping the benefits of the world of generative AI. There are tools that allow you to use your data with third-party models like ChatGPT and Stability AI safely. Beyond these third-party hosted models, there are other options for integrating LLMs in your organization.
Tools for using third-party hosted models safely
Generative AI itself can be used to protect data when sending requests to third party hosted models. LLMs can be used to auto-redact information from a dataset. Recognizing the context of a sentence that can make a word sensitive was a limitation with previous methods of auto-redaction. For example, take the following sentence:
“The President tested positive for COVID-19 again Saturday per a letter from presidential physician Dr. Kevin O’Connor.”
Previous techniques would only label “Saturday” and “Dr. Kevin O’Connor” as time and person words. Clearly, however, “president” and “COVID-19” are sensitive aspects of the sentence as well. Since transformer models can detect context, they are able to correctly label these words as sensitive. Further, advancements in prompt engineering have also increased LLMs’ value in executing this type of task with research from Carnegie Mellon showing how careful few-shot prompting—providing the LLM with a few examples of how it should respond to your requests within the prompt—can generate consistent and accurate results.
Deploying models on premises
Most major cloud providers are expected to be offering ways to deploy large third-party models in your own VPC. Alternatively, training and hosting your own LLM on premises is becoming more accessible as smaller fine-tuned models are performing better on some tasks than large general purpose models. Like the Chinchilla model, these models have fewer parameters but are trained on larger sets of more specialized data.
These models perform comparably to their larger third-party hosted compatriots on common benchmarks. StabilityAI’s open-sourced model Stable Beluga 2 showed almost the same level of accuracy as the closed-source GPT-4, around 60%, on a benchmark called TruthfulQA that measures a model’s proneness to reproducing falsehoods found on the internet. Current research suggests that GPT-4 has been producing less accurate results as the model is fine-tuned broadly, however, locally tuned models will not have this problem of accuracy drift since you can control fine-tuning yourself.
How generative AI can be used to protect data privacy
We are increasingly seeing a growing trend of leveraging large, general-purpose models to generate synthetic data, which is then used to supplement real datasets to train more specialized models. Synthetic, or fake data is artificially generated to mimic real information without using actual data points, ensuring privacy and security.
A group at Stanford used the “self-instruct” method of model training, starting with a small seed of data from GPT-3, and Meta’s LLaMa model to generate a large amount of task-based data to fine-tune a ChatGPT-like model for just $700 as opposed to the millions it took to train ChatGPT. This model, trained on synthetic data, was not only more cost-effective to train but also more secure to use.
Further research has been done to train models on “distilled” training data generated by larger models. Data is distilled from LLMs by following certain prompting patterns, often involving chain of thought tasks, to get more specific data to train smaller and more specialized models. These models show a steeper scaling curve—demonstrating that the more specialized and distilled data LLMs are shown, the better and better they perform.
Madelyn is a Product Marketing Associate at Tonic.ai.
Curious to learn more about the data privacy risks generative AI poses? Check out our upcoming webinar, Data Safety in the Age of AI, where Ander Steele, Head of AI at Tonic.ai, Anjana Harve, Seasoned Global Chief Digital & Information Officer, and Michael Rispin, General Counsel at Sprout Social, discuss how to manage these risks.