How Sidekick classifies sensitive data: the four-tier model for PII and PHI

How Sidekick classifies sensitive data: the four-tier model for PII and PHI

Classification, automatically and continuously

One of the things Sidekick does during the first scan of an environment, and continues doing as the estate evolves, is classify every field by sensitivity. This is not a flagging exercise tacked on at the end. It happens as part of the initial discovery, runs continuously after that, and feeds directly into how the platform handles access, masking, and audit.

The model is deliberately simple. Four tiers, each with a clear definition, and behaviour that follows automatically from the classification.

The core principle: Sensitivity classification is not something to bolt on after a data project is finished. It has to be embedded from the moment the data is discovered, or it will not hold up under audit. Sidekick treats classification as part of discovery, not as a separate workstream.

The four tiers

Every field discovered by Sidekick is classified into one of four levels.

Tier What it covers Example fields
Restricted Personal information that directly identifies an individual. Subject to the strictest handling, including masking from users without explicit access. Social security or national ID numbers, email addresses, full names, physical addresses, phone numbers
Confidential Sensitive information that cannot be directly linked to an individual on its own, but is still subject to controls. Diagnosis codes, internal financial records, contract values
Internal Information that is internal to the business but does not carry personal or compliance sensitivity. Pricing, product codes, internal reference numbers
Public Information that is already available outside the organisation. Store locations, public product names, published office addresses

How the classification works

The classification is performed by Sidekick's language models, running inside the client's environment. The approach is the same one a human reviewer would take — an LLM looks at the field name, the data type, the sample records, and the surrounding context, and assigns the most appropriate tier.

This matters for two reasons. First, it scales. Classifying tens of thousands of fields by hand is not realistic for most organisations, which is why so many estates have inconsistent or missing classifications. Sidekick does this work continuously without manual intervention.

Second, it adapts. Because the classification re-runs as the estate changes — new tables, new fields, new connected sources — the picture stays current. A new field added to a CRM next month will be classified the moment Sidekick discovers it, not the next time someone runs a manual audit.

What happens to Restricted fields

Fields classified as Restricted are masked from users automatically. A user querying the data estate through Sidekick will not see the underlying values for these fields unless they have explicit access. The behaviour applies to natural language queries, to the dashboard, and to the data dictionary view.

This is what makes Sidekick safe to deploy across mixed-audience teams. An operations manager asking a question through Microsoft Teams can get a useful answer without ever being exposed to the individual personal records that underpin it. Compliance teams can review classifications without needing to see the data itself.

For fields where masking is too restrictive, Sidekick supports tokenisation and exclusion as alternatives, configurable per field.

Why this matters for regulated environments

For organisations operating under GDPR, HIPAA, POPIA, or any equivalent regulation, classification is not optional. The challenge is usually not whether to classify, but how to make classification keep pace with a changing data estate.

The structural advantage Sidekick provides:

  • Coverage. Every field across every connected source is classified, not just the ones a team thought to review.
  • Currency. Classifications update continuously as the estate changes. There is no point-in-time gap.
  • Auditability. Every classification, every access attempt, every query is logged. The full trail can be exported into the client's SIEM for ongoing monitoring.
  • Sovereignty. The classification happens inside the client's tenant. The data never leaves the boundary in order to be classified.

Taken together, these properties mean Sidekick produces an audit position most organisations cannot reach with manual processes alone.

Controlling what gets scanned

Sidekick does not assume access to everything by default. The client controls the scope of scanning through three mechanisms:

  • Allow-lists. Explicitly identify the schemas, tables, or sources Sidekick is permitted to scan.
  • Deny-lists. Exclude specific schemas, tables, or fields from scanning entirely.
  • Sampling limits. Configure how many records Sidekick samples per table for context (the default is the first and last 10,000 rows, but this is adjustable).

The result is that Sidekick's scope of access is defined and auditable from day one, and can be tightened or relaxed at any point during operation.

The takeaway

Sensitivity classification is one of the areas where Sidekick's design pays off most clearly. By making classification part of discovery rather than a follow-on workstream, by running the work inside the tenant rather than in a vendor cloud, and by keeping the picture current as the estate changes, it produces a level of governance visibility that most organisations cannot reach through manual processes.

For compliance and security leads, this is usually the part of a Sidekick conversation where the questions stop being defensive and start being operational.

Ready to get started?

See the classification model applied to your estate

The fastest way to understand how the four-tier model works in practice is to see it run against your own data. Talk to the Sidekick Lab team about a Proof of Value engagement and review the classifications produced from your first scan.