2 Anonymization

Researchers need to ensure that the privacy of human participants is properly protected in line with national and/or international law. One way to achieve this goal is to anonymize1 the data, rendering identification of participants nearly impossible. There are two ways in which participants can be identified: 1) through direct identifiers, such as names, addresses or photos, 2) through combinations of indirect identifiers (e.g., date of birth + job title + name of employer). Below we detail ways of minimizing risks, but often the risk of re-identification can never be eliminated completely. Researchers must weigh risks and benefits, bearing in mind that the research participants also have a legitimate interest in the realisation of benefits due to their participation.

First, researchers are advised to consider the legal standards that apply to them in particular. The United States Department of Health and Human Resources has developed a “de-identification standard” (http://bit.ly/2Dxkvfo) to comply with the HIPAA (Health Information Portability and Accountability Act) Privacy rule. Readers may also refer to the guide to de-identification (http://bit.ly/2IxEo9Q) developed by the Australian National Data Services and the accompanying decision tree (http://bit.ly/2FJob3i). Finally, a subsection below deals with new EU data protection laws.

In general, since a relatively limited set of basic demographic information may suffice to identify individual persons (Sweeney, 2000), researchers should try to limit the number of recorded identifiers as much as possible. If the collection of direct or many indirect identifiers is necessary, researchers should consider whether these need to be shared. If directly identifying variables are only recorded for practical or logistic purposes, e.g., to contact participants over the course of a longitudinal study, the identifying variables should simply be deleted from the publicly shared dataset, in which case the data set will be anonymized.

A special case of this situation is the use of participant ID codes to refer to individual participants in an anonymous manner. ID codes should be completely distinct from real names (e.g., do not use initials). Participant codes should also never be based on indirectly identifying information, such as date of birth or postal codes. These ID codes can be matched with identifying information that is stored in a separate and secure, non-shared location.

In the case that indirect identifiers are an important part of the dataset, researchers should carefully consider the risks of re-identification. For some variables it may be advisable or even required to restrict or transform the data. For example, for income information, a simple step is to restrict the upper and lower range (using top- and/or bottom-coding). Similarly location information such as US zip codes may need to be aggregated so as to provide greater protection (especially in the case of low-population areas in which a city or US zip code might be identifying information in conjunction with a variable like age). To analyze these risks more generally for a dataset, it may be useful to consider the degree to which each participant is unique in the dataset and in the reference population against which it may be compared. The nature of the reference population is usually described by the sampling procedure. For instance, the reference population may consist of students at the university where the research was conducted, or of patients at a hospital clinic where a study was performed, or of the adult population of the town where the research was done. Another potentially useful method is to consider threat models, i.e. how reidentification could be performed by different actors with different motives. Such a thought exercise can help uncover weaknesses in data protection. For example, one threat model is that the participant tries to reidentify themselves. In this case, one needs to consider what potentially identifying variables the participants has access to, and what harm may result from successful reidentification in view of what the participant already knows about themselves. Another threat model could be that a third party tries to identify a specific participant based on publicly available information. In this case, it is necessary to consider what publicly available information, if any, would permit reidentification by matching to the original dataset. Such threat assessments have the purpose of determining the risk of (re-)identification and should be used by researchers (ideally with the help of data archiving specialists from libraries, institutional or public repositories) to choose appropriate technical and/or organizational measures to protect participants’ privacy (e.g., by removing or aggregating data or restricting access).

Finally, in case anonymization is impossible, researchers can obtain informed consent for using and sharing non-anonymized data (see below for example templates for consent) or place strict controls on the access to the data.

2.1 EU Data Protection Guidelines

Many researchers will be required to follow new EU data protection guidelines. The European Parliament, the Council of the European Union, and the European Commission have implemented the General Data Protection Regulation (GDPR) (Regulation (EU) 2016/679), a regulation that aims at strengthening and unifying data protection for all individuals within the European Union (EU). It is effective as of May 25, 2018. This new regulation makes a distinction between pseudonymisation and anonymisation. Pseudonymisation refers to the processing of personal data in such a way that it can no longer be associated with a specific data subject unless additional information is provided. It typically involves replacing identifying information with codes2. The key must then be kept separately. The GDPR promotes the use of pseudonymisation as a standard data protection practice for scientific research purposes. Anonymous data are defined as information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable by any means. This regulation does not concern the processing of such anonymous information, including for statistical or research purposes. More information on this regulation can be found on the European Commission’s website (http://bit.ly/2rnv0RA). Chassang (2017) also discusses its implications for scientific research in more detail.

The EU-funded project OpenAIRE (https://www.openaire.eu/) offers the free-to-use data anonymization tool Amnesia “that allows to remove identifying information from data” and “not only removes direct identifiers like names, SSNs etc but also transforms secondary identifiers like birth date and zip code so that individuals cannot be identified in the data” (https://amnesia.openaire.eu/index.html).


  1. The terms “anonymize” and “de-identify” are used differently in various privacy laws, but typically refer to the process of minimizing risk of re-identification using current best statistical practices (El Emam, 2013).

  2. It should therefore be distinguished from a practice sometimes called “pseudo-anonymization”, which involves only partial anonymization (e.g., “Michael Johnson” becoming “Michael J.”).