LLM Data-Masking Does Not Work

LLM data masking fails, raising security concerns for organizations.

by Dickey Singh

Watch a related audiogram

Typical LLM Usage

Consider a real-world scenario where a Customer Success Manager (CSM) interacts with a Large Language Model (LLM) to draft an email. Here is her partial prompt.

LLM Data Leak

Write an empathetic and remorseful email to a Mark McLimea at Amazon regarding a critical issue.

Acknowledge a $1.2M discrepancy caused by a synchronization failure that has impacted 36% of accounts. ...

The LLM did an excellent job with this response:

Subject: CONFIDENTIAL: $1.2M Discrepancy—Action Required
Dear Mark McLimea,
We’ve identified a $1.2M discrepancy in your account due to a synchronization failure, affecting over 35% of accounts.
A fix is scheduled for Monday 2:00–4:00 AM PST. Please review your transactions for last 5 weeks, and confirm if this timing works.
We deeply regret this issue and are committed to resolving it quickly.
Best regards,
Mary Carey
Customer Success Manager
Acme Finance

LLM Data Leaks

Sharing data with external Large Language Models (LLMs) such as OpenAI, Claude, and Gemini raises substantial security concerns for your business, particularly when you share your and your customers’ data.

LLM Data Masking

Most AI products use Data Masking, so

Write an empathetic and remorseful email from Mary Carey, a customer success manager to send to a Mark McLimea at Amazon regarding a critical issue....

would change to the following before sending to an LLM

Write an empathetic and remorseful email from T12342, a customer success manager to send to a U34525 at C34342 regarding a critical issue....

LLM Data Masking does not work

We did several experiments and asked ChatGPT and Claude to guess a customer from data masked and anonymized content.
Both were able to accurately reverse engineer a customer name from a finite list of customers, between 89% to 94% of the time. We asked to generate a Python script and with a little bit of back and forth, we had a future-ready script.

LLM Data Masking does not work.
Salesforce AgentForce acknowledged the problem in this post.

Salesforce’s documentation highlights potential data leakage issues within its Agentforce platform. The company acknowledges that “no model can guarantee 100% accuracy” in data masking, and that “cross-region and multicountry use cases can affect the ability to detect specific data patterns.” This indicates that sensitive information may inadvertently be exposed when using Large Language Models (LLMs) within the platform.
‍

Protecting your proprietary business information, such as your vision, unreleased products, sensitive sales projections, strategic growth plans, and customer data such as usage, adoption, ROI analytics, is of utmost importance. Data leaking can lead to more than just money loss. It can harm a company’s reputation, give competitors an edge, and break customer trust.

See 24 Essential Security Concerns of Sharing Data with LLMs — And How Customer-Facing Leaders Solve Them.

Steps Cast Uses to protect your and your customer data

Beyond Basic Mitigation Strategies that you could use, at Cast, we understand the significant risks posed by sharing sensitive company data with external LLMs. Our solution is designed to ensure your data remains secure while enabling you to leverage the full power of AI.

Reverse Engineering Problems with Data Masking

Since data masking and anonymization (e.g. HPE is Customer2324) shared with LLMs can be easily reverse engineered, we developed a unique and effective solution that bypasses the need of masking and anonymization.

We did several experiments and asked ChatGPT and Claude to guess a customer from data masked content. Both were able to accurately reverse engineer a customer name from a finite list of customers, between 89% to 96% of the time.

Both even generated a Python script that could be further enhanced if we chose to do so.

Cast Solution

Since Cast internalizes the information it learns from all your customer-facing data sources and products, instead of always exposing a RAG interface, the Cast solution is much safer. This matters to businesses with competitors as it eliminates the risk of exposing sensitive information to third-party providers, protecting your business’s competitive advantage and maintaining compliance with data protection regulations.

Sharing content is unavoidable to access an external LLM’s specialized capabilities, however Cast approach does not require explicit masking and unmasking of your data, your customer data, personally identifiable information (PII), and obfuscation. Ensures that even if data is intercepted or improperly used, sensitive information is fully protected.

AI Agents That Work Without Raw Data Sharing

Cast AI Agents leverage secure data pipelines that allow them to perform complex tasks—like generating personalized insights, onboarding presentations, and scaling customer success—without requiring raw data to be shared externally.

Key Takeaway

Why This Matters: Helps businesses achieve high ROI from AI without sacrificing security.

By eliminating data-sharing risks and refraining from sharing sensitive information, Cast empowers companies to securely harness the potential of AI without exposing themselves to the 25 critical risks highlighted in this blog post. Cast stands as your reliable partner in scaling revenue and safeguarding your data.

‍

Listen to Kirsten, a Cast Customer, explain how she solves for FOBU with Cast.app to a HubSpot executive.

‍
Kirsten DiChiappari,
Vice President Customer Success & Experience
vCom Solutions