K-Anonymity is a foundational privacy principle that ensures each individual in a dataset cannot be distinguished from at least K - 1 others. It protects against identity disclosure by making individual records indistinguishable within a group.
This is particularly important when datasets contain quasi-identifiers: combinations of attributes such as age, gender, and zip code that can be used to re-identify individuals even after direct identifiers like names are removed.
K-Anonymity is widely used in data publishing and analysis to provide a measurable privacy guarantee. By ensuring that each combination of quasi-identifiers appears in at least K records, it limits the ability of attackers to isolate individuals.
K-Anonymity reduces privacy risks by generalizing and grouping records so that individuals cannot be uniquely identified. It is an essential step in preparing datasets for safe sharing or research use.
Select a dataset and apply K-Anonymity to explore how it generalizes quasi-identifiers. Adjust the K value to see its impact on privacy and data utility.
For Example:
Why does L-Diversity matter? it is important because it makes it much harder for anyone looking at the data to figure out personal details about someone. Even if they try, they can't be sure about which person has what because of the variety.
In short, L-Diversity keeps secrets dafe by mixing different types of information in each group, so no one stands out too easily. It's like making sure that even if you peeked into a group's secret, it's just a mix of different types of clues, making it tough to single anyone out. Cool, right?
Select a dataset to perform L-Diversity:
Select a sensitive attribute:
Specify the L-value:
Before diving into T-Closeness privacy principles, it is highly recommended to first understand the foundations of K-Anonymity and L-Diversity, as T-Closeness builds upon these essential privacy frameworks.
T-Closeness further extends on the concept of k-anonymity by measuring the distance of the distribution of sensitive values between equivalence classes and the original database.
Before discussing why T-Closeness privacy rules are necessary, let's first review the original dataset.
Name | Age | ZIP | Salary | Disease |
---|---|---|---|---|
Alice | 29 | 47677 | 3000 | Gastric ulcer |
Bob | 22 | 47602 | 4000 | Gastritis |
Charly | 27 | 47678 | 5000 | Stomach cancer |
Dave | 43 | 47905 | 6000 | Gastritis |
Eve | 52 | 47909 | 11000 | Flu |
Ferris | 47 | 47906 | 8000 | Bronchitis |
George | 30 | 47605 | 7000 | Bronchitis |
Harvey | 36 | 47673 | 9000 | Pneumonia |
Iris | 32 | 47607 | 10000 | Stomach cancer |
Here, the header of each coloum in this table are considers as an identifier .
Names are usually sensitive and not particularly useful for analysis, so we can simply remove the "Name" column. This step is not part of T-Closeness; it's just a straightforward removal.
Age | ZIP | Salary | Disease |
---|---|---|---|
29 | 47677 | 3000 | Gastric ulcer |
22 | 47602 | 4000 | Gastritis |
27 | 47678 | 5000 | Stomach cancer |
43 | 47905 | 6000 | Gastritis |
52 | 47909 | 11000 | Flu |
47 | 47906 | 8000 | Bronchitis |
30 | 47605 | 7000 | Bronchitis |
36 | 47673 | 9000 | Pneumonia |
32 | 47607 | 10000 | Stomach cancer |
We have 2 sensitive attribute in this dataset: salary and disease.
We anonymize the data and generalize the Quasi-identifiers: Age and Zip
Target: 3-anonymity, 3-diversity
Age | ZIP | Salary | Disease |
---|---|---|---|
20-29 | 476** | 3000 | Gastric ulcer |
20-29 | 476** | 4000 | Gastritis |
20-29 | 476** | 5000 | Stomach cancer |
>40 | 4790* | 6000 | Gastritis |
>40 | 4790* | 11000 | Flu |
>40 | 4790* | 8000 | Bronchitis |
30-39 | 476** | 7000 | Bronchitis |
30-39 | 476** | 9000 | Pneumonia |
30-39 | 476** | 10000 | Stomach cancer |
If we know the following information about a person:
After applying K-Anonymity and L-Diversity privacy rules, we can no longer determine which data point belongs to a specific person. However, we can still infer that the person has stomach issues. T-Closeness addresses this problem by ensuring that sensitive attribute distributions remain similar across equivalence classes, reducing the risk of inference attacks.
Select a dataset and to explore how different T values affects the overall data.
Differential Privacy is a technique used to protect individual privacy by adding random noise to data. This ensures that the privacy of individuals is preserved while still allowing useful insights to be drawn from the data. For numerical data, we use the Laplace Mechanism which adds random noise drawn from a Laplace distribution.
The Laplace Mechanism works by adding carefully calibrated random noise to each numerical value. Which depends on two parameters:
We will be using a preselected dataset and attribute, Age, where noise will be added to the Age values. An example selection query, "Age >= 40", is also provided to demonstrate how differential privacy works on a individual data point.
Additionally, you can reset everything in the module to it's original state using the Reset button.
Sensitivity is set to 1 for selection queries, but you can adjust it here for demonstration purposes.
Original Result: N/A
Noisy Result: N/A
In this module, you will explore how differential privacy works with categorical data. We use the Exponential Mechanism, which assigns probabilities to each category based on how common they are in the original data, but adds randomness to protect privacy.
The Exponential Mechanism works by:
We will be using a preselected dataset, adult_data, and attribute, Relationship, where noise will be applied to the Relationship column using the Exponential mechanism, allowing you to compare the original distribution with a privacy preserving noisy distribution.
Additionally, you can reset everything in the module to it's original state using the Reset button.
Explore how differential privacy affects various datasets and variables: