K-Anonymity Modules

Introduction to K-Anonymity

Try It Out

What is K-Anonymity?

K-Anonymity is a foundational privacy principle that ensures each individual in a dataset cannot be distinguished from at least K - 1 others. It protects against identity disclosure by making individual records indistinguishable within a group.

This is particularly important when datasets contain quasi-identifiers: combinations of attributes such as age, gender, and zip code that can be used to re-identify individuals even after direct identifiers like names are removed.

How it works

When you check for K = 3, you're verifying whether each combination of selected quasi-identifiers appears in at least three records.
If a combination is too rare—say it only shows up once or twice—it does not satisfy the 3-anonymity requirement and may lead to re-identification.
This plugin automatically scans the dataset and flags these cases, helping you identify which records fall short of the selected K value.

Why it matters

K-Anonymity is widely used in data publishing and analysis to provide a measurable privacy guarantee. By ensuring that each combination of quasi-identifiers appears in at least K records, it limits the ability of attackers to isolate individuals.

In summary

K-Anonymity reduces privacy risks by generalizing and grouping records so that individuals cannot be uniquely identified. It is an essential step in preparing datasets for safe sharing or research use.

K-Anonymity: Try It Out

Select a dataset and apply K-Anonymity to explore how it generalizes quasi-identifiers. Adjust the K value to see its impact on privacy and data utility.

Select Dataset:

L-Diversity Modules

Introduction to L-Diversity

Try It Out

Introduction to L-Diversity

What is L-Diversity?

L-Diversity is a privacy rule that helps protect people's information by making sure that each group of information has different types of secrets in it. This way, if someone tries to figure out who is in a bucket, it's harder because there are different types of answers mixed together.

For Example:

Let's say we have a bucket of health data that includes five people
instead of having just one illness type in that bucket (like if everyone has the same cold), L-Diversity says we should have several differnet illnesses (like a cold, allergies, a sore throat, etc.).
This way, even if you know someone in the group has a cold, you won't know who it is, because there are other possibilities too.

Why does L-Diversity matter? it is important because it makes it much harder for anyone looking at the data to figure out personal details about someone. Even if they try, they can't be sure about which person has what because of the variety.

In short, L-Diversity keeps secrets dafe by mixing different types of information in each group, so no one stands out too easily. It's like making sure that even if you peeked into a group's secret, it's just a mix of different types of clues, making it tough to single anyone out. Cool, right?

Try L-Diversity

Select a dataset to perform L-Diversity:

Choose a dataset:

T-Closeness Modules

Before diving into T-Closeness privacy principles, it is highly recommended to first understand the foundations of K-Anonymity and L-Diversity, as T-Closeness builds upon these essential privacy frameworks.

Quick Introduction

Try It Out

Introduction to T-Closeness

Definition

T-Closeness further extends on the concept of k-anonymity by measuring the distance of the distribution of sensitive values between equivalence classes and the original database.

Why we need T-Closeness privacy rules?

The original dataset

Before discussing why T-Closeness privacy rules are necessary, let's first review the original dataset.

Name	Age	ZIP	Salary	Disease
Alice	29	47677	3000	Gastric ulcer
Bob	22	47602	4000	Gastritis
Charly	27	47678	5000	Stomach cancer
Dave	43	47905	6000	Gastritis
Eve	52	47909	11000	Flu
Ferris	47	47906	8000	Bronchitis
George	30	47605	7000	Bronchitis
Harvey	36	47673	9000	Pneumonia
Iris	32	47607	10000	Stomach cancer

Here, the header of each coloum in this table are considers as an identifier .

Remove Unnecessary Identifiers

Names are usually sensitive and not particularly useful for analysis, so we can simply remove the "Name" column. This step is not part of T-Closeness; it's just a straightforward removal.

Age	ZIP	Salary	Disease
29	47677	3000	Gastric ulcer
22	47602	4000	Gastritis
27	47678	5000	Stomach cancer
43	47905	6000	Gastritis
52	47909	11000	Flu
47	47906	8000	Bronchitis
30	47605	7000	Bronchitis
36	47673	9000	Pneumonia
32	47607	10000	Stomach cancer

Apply K-Anonymity and L-Diversity

We have 2 sensitive attribute in this dataset: salary and disease.

We anonymize the data and generalize the Quasi-identifiers: Age and Zip

Target: 3-anonymity, 3-diversity

Age	ZIP	Salary	Disease
20-29	476**	3000	Gastric ulcer
20-29	476**	4000	Gastritis
20-29	476**	5000	Stomach cancer
>40	4790*	6000	Gastritis
>40	4790*	11000	Flu
>40	4790*	8000	Bronchitis
30-39	476**	7000	Bronchitis
30-39	476**	9000	Pneumonia
30-39	476**	10000	Stomach cancer

The issues

If we know the following information about a person:

ZIP: 47623
Age: 23

After applying K-Anonymity and L-Diversity privacy rules, we can no longer determine which data point belongs to a specific person. However, we can still infer that the person has stomach issues. T-Closeness addresses this problem by ensuring that sensitive attribute distributions remain similar across equivalence classes, reducing the risk of inference attacks.

Try It Out

Select a dataset and to explore how different T values affects the overall data.

Differential Privacy Modules

Introduction I: Numerical

Introduction II: Categorical

Try It Out

Differential Privacy: Numerical

Differential Privacy is a technique used to protect individual privacy by adding random noise to data. This ensures that the privacy of individuals is preserved while still allowing useful insights to be drawn from the data. For numerical data, we use the Laplace Mechanism which adds random noise drawn from a Laplace distribution.

The Laplace Mechanism works by adding carefully calibrated random noise to each numerical value. Which depends on two parameters:

Epsilon (privacy budget) - controls the privacy level, where smaller values means stronger privacy guarantees and more noise.
Sensitivity - represents the maximum change each data point can have on the output.

We will be using a preselected dataset and attribute, Age, where noise will be added to the Age values. An example selection query, "Age >= 40", is also provided to demonstrate how differential privacy works on a individual data point.

Additionally, you can reset everything in the module to it's original state using the Reset button.

Instructions:

Adjust the Epsilon and Sensitivity values using the sliders below. These values control the amount of noise added to protect privacy.
Click "Apply To Numerical" to apply differential privacy to the Age column and observe the original age vs the noisy age data points and compare the original query result and the noisy query result displayed below.

Epsilon: 0.1

Sensitivity: 1.0

Sensitivity is set to 1 for selection queries, but you can adjust it here for demonstration purposes.

Selection Query Results

Original Result: N/A

Noisy Result: N/A

Differential Privacy: Categorical

In this module, you will explore how differential privacy works with categorical data. We use the Exponential Mechanism, which assigns probabilities to each category based on how common they are in the original data, but adds randomness to protect privacy.

The Exponential Mechanism works by:

Computing a score for each possible category based on its frequency in the dataset.
Using these scores to assign selection probabilities that preserve privacy.
Randomly selecting categories according to these privacy preserving probabilities.

We will be using a preselected dataset, adult_data, and attribute, Relationship, where noise will be applied to the Relationship column using the Exponential mechanism, allowing you to compare the original distribution with a privacy preserving noisy distribution.

Additionally, you can reset everything in the module to it's original state using the Reset button.

Instructions:

Adjust the Epsilon and Sensitivity values using the sliders below. The Exponential Mechanism uses these parameters to determine selection probabilities for each category.
Click Apply To Categorical to apply differential privacy to the Relationship column and observe how it alters the distribution in the graph and the table below.

Epsilon: 0.1

Sensitivity: 1.0

Categorical Distribution Table

Differential Privacy: Try It Out

Explore how differential privacy affects various datasets and variables:

Select a dataset from the dropdown menu
Choose a variable (numerical or categorical) to apply privacy to
Click Start to initialize the graph and table
Adjust the Epsilon and Sensitivity values using the sliders or input box
Click Apply Privacy to see the privacy-utility trade-off
Use Reset to restore original values of the graph and table

Select Dataset:

Privacy Rules

K-Anonymity

L-Diversity

T-Closeness

Differential Privacy

K-Anonymity Modules

Introduction to K-Anonymity

Try It Out

What is K-Anonymity?

How it works

Why it matters

In summary

K-Anonymity: Try It Out

Dataset Preview

L-Diversity Modules

Introduction to L-Diversity

Try It Out

Introduction to L-Diversity

What is L-Diversity?

Try L-Diversity

T-Closeness Modules

Quick Introduction

Try It Out

Introduction to T-Closeness

Definition

Why we need T-Closeness privacy rules?

The original dataset

Remove Unnecessary Identifiers

Apply K-Anonymity and L-Diversity

The issues

Try It Out

Differential Privacy Modules

Introduction I: Numerical

Introduction II: Categorical

Try It Out

Differential Privacy: Numerical

Instructions:

Selection Query Results

Differential Privacy: Categorical

Instructions:

Categorical Distribution Table

Differential Privacy: Try It Out