Estimating the level of anonymity of databases

October 21, 2024

Dr. Zoltán Alexin held a lecture about the meaning of “anonymous” data on October 2., at the 8th Privacy Days Praha conference in Prague, organized by the European Federation of Data Protection Officers.

The lecture discussed the risks of releasing medical datasets, emphasizing that those are often made up of pseudonymous, not anonymous, data, which means that they can be re-identified using additional data. The dark web is full of stolen databases suitable for such attacks. Therefore, custodians must assess datasets carefully before handing them over to researchers, as the included quasi-identifiers (information crumbs), like date of birth, ZIP code, or disease codes can uniquely identify a living individual.

The proposed statistical method computes the resilience of the dataset against re-identification by calculating the overall entropy of quasi-identifiers. Entropy is the average number of bits (amount of information) we know about an individual in the dataset. Higher entropy indicates greater vulnerability. 

The lecture cited an example about computing the entropy of a database that contains the date of birth and the ZIP code of the residential address of Hungarian citizens. Another statistical feature that is more illustrative is the average k, the average number of individuals in the dataset who share the same set of quasi-identifiers.

Page last modified: October 21, 2024