Privacy protection with AI: Survey of data-anonymization techniques
Abstract—Text anonymization is becoming a more important part of data custodianship due, in part, to strengthened privacy legislation. Many open source libraries exist to address text-anonymization via Natural Language Processing (NLP) techniques. Named Entity Recognition (NER), a sub-task of NLP can be used to recognize Personally Identifiable Information (PII) in unstructured text such as names, location and organization. Since the consequences of a poorly performing text anonymization has the potential to cause significant harm it is important to understand the strengths and weaknesses of various approaches. A survey of four different NER libraries is conducted to become familiar with the performance of pre-trained models and the process of fine-tuning those models. Using precision, recall and F1 scores each model’s performance is compared against the other. Word embeddings and stacked word embedding techniques demonstrate the best performance and also take the longest amount of time to train. Conditional Random Fields (CRF), an approach to information extraction with a long history, performs well and requires relatively little time to train. SpaCy demonstrates improvements with re-training. At the time of writing Presidio API has no support for the ORG entity. A general trend across most models is a low GPE precision score. Recall is the most important measurement to consider in a privacy protection context.
Keywords—Named Entity Recognition, Natural Language Processing, anonymization, privacy, Artificial Intelligence
A clear definition of privacy is important for determining what needs protecting. The debate around privacy can be understood as a question of access to information or control of information which has implications for the types of solutions that are designed to protect it . For instance, data anonymization tools and techniques are more relevant from the access to information side of the privacy debate, though it would be wrong to assume that control of information doesn’t also pose risks. In acknowledging the Snowden revelations Macnish demonstrates how, in the context of government surveillance of internet communications, competing definitions of control and access are both potentially problematic in their ability to cause harm. The application of data anonymization techniques to achieve a degree of privacy protection assumes organizations in control of data are already in a position of risk, mitigated by compliance with relevant privacy laws and regulations. As was demonstrated with the Facebook-Cambridge Analytica legal proceedings in 2018, private organizations do not always comply, or do not adequately enforce policy intended to protect access to and control of data. Despite best efforts regulations do not always have the intended effect. Hassan et al. define privacy in the context of self-determination, which implies a notion of control over decisions to seclude themselves, informationally or otherwise . The research in this paper is directed towards solutions for data privacy that pertain to accessing information, while acknowledging that control of information is another area of focus with significant privacy risk.
Related to informational privacy is the notion of personal data. Article 4 of the General Data Protection Regulation (GDPR) defines personal data as “…any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;” . The GDPR expressly states that stored data containing personal information undergo an anonymization or pseudonymization process . It further defines anonymization as “…processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;” . Given that there are legal obligations for anonymization, and substantial fines for not complying, techniques for text anonymization have become increasingly relevant .
Natural Language Processing
Natural Language Processing (NLP) is an area of study that looks to unstructured text as a data source. Named Entity Recognition (NER) is a sub-task of NLP and is sometimes looked at as a sequence labeling problem, which means the sequence of text in a sentence is labelled with linguistic tags . NER includes sub-processes such as tokenization, tagging and parsing (Parts of Speech). Manually tagging data is the process of labelling unstructured text, also referred to as annotation. In a data anonymization context words would be labelled according to the types of Personally Identifiable Information (PII) entities they represent such as Name, Location or Date of Birth. In cases where pre-trained models may not sufficiently generalize to documents in other, more specific domains, manually tagging text may be necessary to fine-tune models in order to obtain a sufficient level of accuracy. In the domain of language learning for instance, the likelihood of encountering spelling or grammatic errors and poorly formed sentences in corpora is greater, making reliance on pre-trained NLP models less reliable. Megyesi et al describe a manual annotation process in building their pseudonymization tool specific to the domain of people learning new languages .
Word embeddings are learned, vector representations of words derived from large bodies of text and are used for measuring semantic and syntactic similarities between words . Word embeddings are trained through a machine learning process, separate from the training processes which makes use of those embeddings. In 2013, Mikolov et al developed an architectural approach to generating word vectors called Word2Vec. Publicly available, pre-trained Word2Vec models were trained on the Google News corpus containing over 100 billion words using Continuous Bag-of-Words (CBOW) and Skip-gram models. Training data, training algorithm and size of the vector are factors that affect the quality of word vectors. Available Word2Vec vector files represents each word in a 3-million-word vocabulary as a 300-dimensional vector. Once a vector file is produced, similarities between words can be exploited for a number of NLP tasks. Learned relationships can be inferred by comparing the closeness of words across multiple dimensions. For example, Word2Vec would make visible the semantic relationship between the word pair ‘king’ and ‘queen’ as similar to the relationship between the words ‘man’ and ‘woman’. Syntactic relationships, such as ‘live’ and ‘lived’ being the same as ‘run’ and ‘ran’ are also made visible through numeric representation. The significance for data anonymization techniques which rely on NER is when semantic relationships between similar words are preserved, downstream NLP tasks such as NER produce fewer errors.
Contextualized Word Embeddings
Recently, contextual approaches to word embeddings have improved results for tasks like NER . Unlike static embeddings such as GloVe, where each word has only a single vector, contextual embeddings produce multiple, different vectors depending on the surrounding text, or the context of a word in a sentence, or a character in a word . Depending on how a language model is built, the sequence can be directional or bidirectional. ELMo and BERT, which stands for Bidirectional Encoder Representations from Transformers, are examples of approaches to contextual embeddings . Capturing underlying relationships in a language model improves data anonymization techniques that rely on tasks such as NER.
Measures of Privacy Protection in Structured Data
Privacy models such as k-anonymity, l-diversity and t-closeness can be used to measure the risk of identity disclosure . Critiques about the effectiveness of these models and their enhancements are described as tradeoffs between preserving privacy and preserving the utility of data . As a privacy-preserving data transformation technique, these models can be used to assess the likelihood of re-identification, commonly referred to as Statistical Disclosure Control (SDC). Given that these privacy models are effective on structured data, its relevance to this research on unstructured text is limited. Though efforts have been made to use k-anonymity to preserve privacy in an unstructured context, it requires converting unstructured data to structured data and classifying it as sensitive, quasi-sensitive and personal data .
General Data Protection Regulation
In response to an awareness of a power imbalance created when an increasing number of organizations maintain both control of and access to large swaths of personal data, regulations seek to impose limitations on those powers. California’s Consumer Privacy Act (CCPA), the Health Insurance Portability and Accountability Act (HIPPA) and Children’s Online Privacy Protection Rule (COPPA) are examples of American legal frameworks that define data protection in specific contexts of informational privacy . Though the General Data Protection Regulation (GDPR) originated in Europe, it is not bound by citizenship or membership to the European Union and has implications beyond Europe’s physical boundaries. The Freedom of Information and Privacy Protection Act (FOIPPA) in British Columbia, Canada, is specific to the provincial government’s responsibilities around the handling of PII including specific data protection behaviours. The regulatory landscape that some organizations find themselves in requires ongoing compliance or in some cases, be subjected to significant fines.
Electronic Health Records (EHR)
The trend for medical records is that they are increasingly electronic. Medical research that stands to benefit from EHR must comply with regulations, such as HIPPA, which requires de-identification by removing Personal Health Information . Depending on the country of origin, other regulatory bodies such as Commission Nationale de l’Informatique et des Libertés (CNIL) in France also have requirements to anonymize health data for research purposes .
From a threat modelling perspective, PII in information systems are assets to be identified and protected. Security audits identify targets for potential attacks and report back on recommended security measures. PII, including Social Insurance Numbers, Provincial Health Numbers, Names, Date of Birth is a target for unauthorized access. Roughly a third of insider threat cases at US financial institutions involved the targeting of PII to be used for fraud . Even more concerning, very few data points are required to identify most people with a degree of confidence, given seemingly sparse data such as a person’s date of birth, postal code and gender . Applying NER to mask, suppress, distort, swap, or generalize PII can mitigate the impact of potential security threats, and reduce the organizational risk that comes with being a data custodian of PII.
The purpose of the research is to survey existing named entity recognition models in order to better understand their strengths and weaknesses. The research focuses on gaining insight into what reasons are behind the differences in performance for NER models.
The Conditional Random Fields (CRF) algorithm is used to build a model using Python’s implementation in sklearn-crfsuite with a Limited-Memory Broyden–Fletcher–Goldfarb–Shanno (L-BGFS) optimization. CRF is a popular, discriminant probability, non-directional graph model appropriate for sequence analysis . This means that CRF can be used to build a character or word-based language model which, given a word or a character, can predict the next or previous word based on the distributed probability of previous and next words learned during the process of training a model. Song et al acknowledge that parameter estimation plays a significant role in the quality of the model. The authors state that the weight of these parameters can be obtained efficiently using the L-BFGS method.
A SpaCy model called `en_core_web_lg` is a pre-trained statistical model for English that offers text-specific token vectors, Parts of Speech (PoS) tags, and named entities; it is trained on OntoNotes which is a large corpus of annotated text. The model is trained with a transitional approach similar to that outlined by Lample et al . However, while Lample et al suggest using a bi-directional Long, Short-Term Memory (LSTM), SpaCy instead uses convolutional layers with residual connections, layer normalization and maxout non-linearity. Another important aspect to the SpaCy model is the use of bloom embeddings. Bloom embeddings take advantage of a hashing trick to compress a large but sparse vector into a dense embedding which contains the same data but is compressed . In the case of SpaCy the bloom embeddings take into consideration the features of neighbouring text giving each context of a word a unique embedding.
Flair is an open-source framework that comes with various libraries, models and interfaces that enable the application of NLP tasks . Word embeddings convert words into multi-dimensional vectors and Flair allows for multiple embeddings to be stacked, creating hybrid embeddings. For example, a 300-dimensional Word2Vec stacked on top of a 200-dimensional One-Hot embedding becomes a 500-dimensional hybrid embedding. Stacked embeddings have been shown to improve on previous methods of NER. We have made use of 4 different embeddings based on those imported in Microsoft’s Presidio Research project. These 4 embeddings are GloVe, BERT, forward trained news embeddings, and backward trained news embeddings.
The backwards and forward news embeddings are both created using the same 1-billion-word corpus. The difference is that with the backwards language model words are given from the end to the start and preceding words are predicted instead of trying to predict the next word as is done with the forward model. Flair recommends combining both the forward and backward versions when possible.
Presidio is offered both as an open source self-hosted service and Software-As-A-Service (SAAS) hosted by Microsoft. The NER implementation offers identification of PII and multiple anonymization techniques via RESTful API. Presidio is written in Go and Python and leverages Kubernetes for container orchestration, SpaCy for NER and regex for predefined recognizers . Predefined recognizers includes phone number, email, credit card number, and other types of PII that can be defined by a predictable pattern.
The method for evaluating the performance of the models is through producing a confusion matrix for each model that reports precision, recall and F1 score for the entities ORG (Organization), GPE (Geo-Political Entity), PERSON (People) and PII (Personally Identifiable Information).
One way to look at precision is a measure of the number of false positives. For example, if there are 10 names (PII) in a body of text precision measures how well the model identifies all 10 and would consider penalties for misidentifying a word(s). So, if the model returns 12 names (and there are actually only 10), the model misidentified 2. So, 10 / (10 + 2) = 0.83.
P = True Positive / (True Positive + False Positive) (1)
Conversely, Recall measures false negatives. So, if that same model predicts 8 out of the 10 name entities, it will have missed 2. Missing name entities is more problematic from a privacy perspective since it means that PII will not be identified and subsequently anonymized. The equation for recall in the above case is 8 / (8 + 2) = 0. 80.
R = True Positive / (True Positive + False Negative) (2)
The F1 score is a balanced measure which takes into consideration the performance of both Precision and Recall. It is calculated as the product of Precision and Recall divided by the sum of precision and recall, multiplied by 2. Given a precision of 0.83 and a recall of 0.80 the equation for F1 would be 2 * ((0.83 * 0.80) / (0.83 + 0.80)) = 0.8147
F1 = 2PR / (P+R) (3)
While F1 is the most common F score the 1 can be adjusted to change the meaning. An F2 score for example is the same as an F1 score with the Recall weighted to be twice as important as precision. Alternatively, precision can be given twice as much weight by calculating instead the F0.5 score.
Sample Confusion Marix:
- Sample Confusion Matrix
Because of the differences between models and the variance in entity type support, each of the labels used in the evaluation is an abstraction for more than one entity type. In order to account for these differences and present comparable results, an entity dictionary is used to map one entity to another, for instance mapping PER to PERSON, or COUNTRY to GPE. A full listing of this mapping scheme is presented in Table 2.
- Mapping Scheme for Entity Labels
|Mapping for Entity Labels|
|ORG, ORGANIZATION||COUNTRY, CITY, LOCATION, NATION_MAN, NATION_WOMAN, NATION_PLURAL, NATIONALITY, GPE||FIRST_NAME, LAST_NAME, PERSON||FIRST_NAME, LAST_NAME, PERSON, COUNTRY, CITY, LOCATION, NATION_MAN, NATION_WOMAN, NATION_PLURAL, NATIONALITY, GPE, ORG, ORGANIZATION|
Results from PII
The PII results demonstrate a distinction between retrained, fine-tuned models and pretrained models. Flair2, Flair 1, CRF and SpaCy 2 represents the highest four PII measures and all four are fine-tuned. Flair 2, a model that uses a large set of stacked bindings reflects the best PII F Measure overall. Notably, CRF is considerably more lightweight had only slightly lower performance measures suggesting that Conditional Random Fields is a performant algorithm for these NLP tasks compared to state-of-the-art, stacked word embeddings.
Results from Recall for Entities
Different patterns emerge from isolating precision and recall measurements for entities ORG, PERSON and GPE. Figure 2 demonstrates SpaCy 2 obtaining both the highest (100%) overall recall score for ORG, and the lowest overall precision score (< 20%). This means that many false positives were generated, the three most frequent were ‘Texas’, ‘,’ and ‘Hobbits’. Looking at the errors, organization names with multi-token ORG entities, such as ‘Trak Auto’ or ‘The Flying Bear’ were incorrectly predicted as single entities ‘Trak’ and ‘Auto’ labelled as a non-entity token ‘O’ in the BILUO tagging scheme. Also noted is the inclusion of punctuation which may indicate an issue upstream of the model evaluation, such as with the data specifically or an issue with the data cleaning phase. The most frequent false negatives also include punctuation, such as “’S” and a hyphen “-” supporting further investigation into issues with data cleaning, tokenization or character encoding. The measure for Spacy2 ORG precision decreased dramatically in the re-trained model from the pretrained model which is counter to expectations. The absence of ORG entity support in the Presidio API, which leverages SpaCy, is also noted though only speculation can be applied towards its relationship to the diminishing precision score. Further investigation would be required to come to any conclusions.
Across all models, GPE precision appears to be the lowest performing score, which follows a similar trend to the one noted with ORG precision. Similarly, the measurement for recall is consistently higher, suggesting that with low precision, there are fewer positive examples that could become false negatives.
Results from Recall only
Overall, ORG recall is the least performant measurement, suggesting there may be something specific about that entity that is more difficult than others to capture. Flair 1 and Flair 2 represents a stacked embedding approach and produces the highest recall score for all entities. The CRF algorithm also generates high recall scores and took considerably less time to train, indicating that it would be useful in a data-anonymization context where privacy protection is the focus and CPU or time resources are constrained.
Ferrndez et al. evaluate automated de-identification strategies for clinical documents so that the documents can be brought into compliance with the Health Insurance Portability and Accountability Act (HIPPA) and be used for research purposes . Five de-identification tools are selected, and each are evaluated with ‘out-of-the-box’ configuration. Three of the tools are rule based systems; they use regular expressions to match patterns and dictionary lookups to detect common terms such as personal names and geographic places. The other two tools use machine learning classifiers built on Conditional Random Field (CRF) models that use a Beginning, Inside, Outside (BIO) schema to detect Personal Health Information (PHI). The authors evaluate the tools by comparing differences in rule-based vs machine learning and by looking at precision, recall and F2 scores for exact, partial and fully contained matches. The authors cite recall as the most important measurement with respect to de-identification. Given that there are 18 PHI identifiers, the differences between the rule based and machine learning systems could sometimes be attributed to how poorly the default, out-of-the-box configurations could be applied to new domains or how well the syntactic qualities of the entity could be expressed as a regular expression. For instance, email addresses, Social Security numbers and IP address numbers are predictable patterns of characters, sometimes with a fixed length whereas institutional specific PHI annotations don’t often generalize well.
Conclusions and Future Work
Word embeddings and stacked word embedding techniques demonstrate the best performance and also take the longest amount of time to train. Conditional Random Fields (CRF) performs well and requires relatively little time to train which may make it more practical in situations where both time and CPU resources are limited. SpaCy demonstrates improvements with re-training and demonstrates unique behavior with respect to the ORG entity which may have more to do with data cleaning than demonstrate an insufficiency with the algorithm itself. Presidio API has no support for the ORG entity which may be related to the SpaCy behavior since it is declared as a dependency. A general trend across most models is a low GPE precision score.
In a privacy preserving context, recall is the most important metric for evaluating the performance of a model and in at least one study, it is given double the weight . Though the weights of each of these metrics were not modified in our evaluation, the research still considers recall as the most important measure since a low recall score could lead to a privacy breach. In a situation where a model is used to de-identify text, a high false negative measurement means that PII entities would be missed, making the results of de-identification less effective.
While similar studies like that done by Ferrndez et al  opted for the use of default settings, an equivalent option was not available. However, most libraries provide pre-trained NER models that serve as an equivalent default setting, or out-of-the-box solution. Flair includes both a 4 and 18 class NER, Spacy an English multi-task CNN trained on OntoNotes and sklearn-crfsuite can be used with the CoNLL 2002 dataset to build an NER system.
Another challenge when comparing libraries or different NER models within the same library is the differences in the number of entities recognized. When comparing a 4 class NER model to an 18 class NER model there is more room for error when PII is mapped to 10 or more labels compared to 4 or less labels.
Discovering the reasons behind a general trend across the majority of models for GPE precision to be low could be one area of future research. GPE consists of a variety of potential entities including city, address, geographical coordinated, country and expressions of nationality. What effect does this diversity have on the ability of a model to accurately identify GPE? Precision measurements for recognizing the ORG entity using SpaCy decreased after fine-tuning the pre-trained model, which deviates from expectations. Discovering if there is a connection between that behaviour and Presidio API’s decision not to support ORG represents a possible future research direction.
 K. Macnish, “Government Surveillance and Why Defining Privacy Matters in a Post‐Snowden World,” J. Appl. Philos., vol. 35, no. 2, pp. 417–432, May 2018, doi: 10.1111/japp.12219.
 F. Hassan, D. Sanchez, J. Soria-Comas, and J. Domingo-Ferrer, “Automatic Anonymization of Textual Documents: Detecting Sensitive Information via Word Embeddings,” in 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), Rotorua, New Zealand, Aug. 2019, pp. 358–365, doi: 10.1109/TrustCom/BigDataSE.2019.00055.
 “Art. 4 GDPR – Definitions,” General Data Protection Regulation (GDPR). https://gdpr-info.eu/art-4-gdpr/ (accessed Mar. 15, 2020).
 B. Megyesi et al., “Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish,” p. 10, 2018.
 A. Akbik, D. Blythe, and R. Vollgraf, “Contextual String Embeddings for Sequence Labeling,” p. 12.
 T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” ArXiv13013781 Cs, Sep. 2013, Accessed: Feb. 29, 2020. [Online]. Available: http://arxiv.org/abs/1301.3781.
 J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” p. 16.
 K. Ethayarajh, “How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings,” ArXiv190900512 Cs, Sep. 2019, Accessed: Apr. 03, 2020. [Online]. Available: http://arxiv.org/abs/1909.00512.
 A. Pika, M. T. Wynn, S. Budiono, A. H. M. ter Hofstede, W. M. P. van der Aalst, and H. A. Reijers, “Privacy-Preserving Process Mining in Healthcare,” Int. J. Environ. Res. Public. Health, vol. 17, no. 5, p. 1612, Mar. 2020, doi: 10.3390/ijerph17051612.
 J. Domingo-Ferrer and V. Torra, “A Critique of k-Anonymity and Some of Its Enhancements,” in 2008 Third International Conference on Availability, Reliability and Security, Mar. 2008, pp. 990–993, doi: 10.1109/ARES.2008.97.
 B. Mehta, U. P. Rao, R. Gupta, and M. Conti, “Towards privacy preserving unstructured big data publishing,” J. Intell. Fuzzy Syst., vol. 36, no. 4, pp. 3471–3482, Apr. 2019, doi: 10.3233/JIFS-181231.
 O. Ferrndez, B. R. South, S. Shen, F. J. Friedlin, M. H. Samore, and S. M. Meystre, “Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents,” BMC Med. Res. Methodol., vol. 12, no. 1, pp. 109–124, Jan. 2012, doi: 10.1186/1471-2288-12-109.
 D. Proux et al., “Natural Language Processing to Detect Risk Patterns Related to Hospital Acquired Infections,” p. 7.
 A. Cummings, T. Lewellen, D. McIntire, A. P. Moore, and R. Trzeciak, “Insider Threat Study: Illicit Cyber Activity Involving Fraud in the U.S. Financial Services Sector:,” Defense Technical Information Center, Fort Belvoir, VA, Jul. 2012. doi: 10.21236/ADA610430.
 S. Murthy, A. Abu Bakar, F. Abdul Rahim, and R. Ramli, “A Comparative Study of Data Anonymization Techniques,” in 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), Washington, DC, USA, May 2019, pp. 306–309, doi: 10.1109/BigDataSecurity-HPSC-IDS.2019.00063.
 S. Song, N. Zhang, and H. Huang, “Named entity recognition based on conditional random fields,” Clust. Comput., vol. 22, no. S3, pp. 5195–5206, May 2019, doi: 10.1007/s10586-017-1146-3.
 G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural Architectures for Named Entity Recognition,” ArXiv160301360 Cs, Apr. 2016, Accessed: May 21, 2020. [Online]. Available: http://arxiv.org/abs/1603.01360.
 J. Serrà and A. Karatzoglou, “Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks,” ArXiv170603993 Cs, Jun. 2017, Accessed: May 23, 2020. [Online]. Available: http://arxiv.org/abs/1706.03993.
 https://github.com/flairNLP/flair. flair, 2020. https://github.com/microsoft/presidio. Microsoft, 2020.