{"id":16955,"date":"2024-01-04T21:10:25","date_gmt":"2024-01-05T02:10:25","guid":{"rendered":"https:\/\/www.iri.com\/blog\/?p=16955"},"modified":"2024-01-05T07:39:16","modified_gmt":"2024-01-05T12:39:16","slug":"named-entity-recognition-ner-in-iri-darkshield","status":"publish","type":"post","link":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/","title":{"rendered":"Training NER Models in IRI DarkShield"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Natural Language Processing (NLP) task wizards in the <\/span><a href=\"https:\/\/www.iri.com\/products\/workbench\/darkshield-gui\"><span style=\"font-weight: 400;\">IRI Workbench GUI for DarkShield<\/span><\/a><span style=\"font-weight: 400;\"> are designed to help you improve the accuracy of finding PII in unstructured sources using content-aware search matchers, called <a href=\"https:\/\/www.iri.com\/blog\/data-protection\/data-matchers\/\">Data Matchers<\/a>, during DarkShield data discovery operations. The results from NLP and other supported search matchers are serialized in PII location annotation and log files which are used for data classification, masking, and reporting.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To make these NLP jobs more efficient, DarkShield can use \u2013 and help you train using Machine Learning \u2013 Apache OpenNLP, PyTorch, or TensorFlow models for Named Entity Recognition (NER). NER models make it easier to find proper nouns like people\u2019s names within text files and documents) because they recognize terms from English (or other language) sentence grammar.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This article documents the currently supported NLP task wizards in DarkShield, which include:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Transformer Training Data (<\/span><a href=\"https:\/\/noisy-text.github.io\/2022\/index.html#\"><span style=\"font-weight: 400;\">WNUT format<\/span><\/a><span style=\"font-weight: 400;\">) Creation<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">NER Transformer Fine-tuning<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">OpenNLP Model Supervised Training<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">From DarkShield menu in the top toolbar of IRI Workbench, you can select from this list:<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16960\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/nlp-task-wizard2-300x256.png\" alt=\"\" width=\"420\" height=\"358\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/nlp-task-wizard2-300x256.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/nlp-task-wizard2.png 508w\" sizes=\"(max-width: 420px) 100vw, 420px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">After making your choice in the drop-down box on this top-level page, click <\/span><i><span style=\"font-weight: 400;\">Next<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h5><b>Preprocessing Training Data (WNUT format)<\/b><\/h5>\n<p><span style=\"font-weight: 400;\">This wizard assists in the creation of training data that can be used in the fine-tuning of Named Entity Recognition models. These NER models are referred to as <\/span><i><span style=\"font-weight: 400;\">transformers.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">Transformers require datasets referred to as <\/span><i><span style=\"font-weight: 400;\">training data <\/span><\/i><span style=\"font-weight: 400;\">during the process of fine-tuning<\/span><i><span style=\"font-weight: 400;\">. <\/span><\/i><span style=\"font-weight: 400;\">The task selection <\/span><i><span style=\"font-weight: 400;\">Create Transformer Training Data<\/span><\/i><span style=\"font-weight: 400;\"> helps you produce training data, in the correct format, which is required for the fine-tuning.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Training data is produced through a multi-step process:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Raw text from a text file is split into sentences using various sentence segmentation techniques.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Each sentence is then processed and tokenized so that each word in the sentence is assigned a label (e.g. Person, Location, Organization\u2026)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Annotated sentences are translated into Workshop of Noisy User-generated Text (<\/span><a href=\"https:\/\/noisy-text.github.io\/2022\/index.html#\"><span style=\"font-weight: 400;\">WNUT<\/span><\/a><span style=\"font-weight: 400;\">) format.<\/span><\/li>\n<\/ol>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16977\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/setup-for-training-data-prep-300x276.png\" alt=\"\" width=\"454\" height=\"418\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/setup-for-training-data-prep-300x276.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/setup-for-training-data-prep.png 510w\" sizes=\"(max-width: 454px) 100vw, 454px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The first page is the setup page for the NLP task. You must indicate the folder that the batch script and resulting training data will be placed inside. Next, choose a NER model that will be used in the preprocessing task.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The loaded pretrained model will be used to annotate the raw text. Therefore, the model used to annotate the training data and the model that will be fine-tuned should either be the same, or possess the same label identifiers.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">After choosing a model, indicate the model type. Currently, supported model-type frameworks are PyTorch and TensorFlow. Then click <\/span><i><span style=\"font-weight: 400;\">Next<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16962\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/raw-training-data-300x275.png\" alt=\"\" width=\"415\" height=\"380\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/raw-training-data-300x275.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/raw-training-data.png 509w\" sizes=\"(max-width: 415px) 100vw, 415px\" \/><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">On the next page, you will provide text documents that are in raw, free-floating text format. <\/span><span style=\"font-weight: 400;\">These documents will be compiled into one large dataset and split apart by sentence.\u00a0<\/span><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">To split by sentence, an NLP task called sentence boundary detection must be performed. Two methods are available to achieve this: split on punctuations, or natural language processing to detect boundaries of sentences.<\/span><\/p>\n<h5 style=\"text-align: left;\"><strong>Split on Punctuations<\/strong><\/h5>\n<p><span style=\"font-weight: 400;\">As the name implies, this option splits free-floating text into separate sentences by relying on punctuations (.?!) as delimiters. It is fast and simple to apply but can lead to false positives.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For example the sentence \u201cDr. Wood met Mr. Smith at the clinic today.\u201d, would be split into three parts, &#8220;Dr &#8220;, &#8220;Wood met Mr\u201d, \u201cSmith at the clinic today\u201d.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Relying on punctuation as sentence delimiters can also be fraught due to varying uses of punctuation in different written languages.<\/span><\/p>\n<h5><strong>NLP Segmentation<\/strong><\/h5>\n<p><span style=\"font-weight: 400;\">Using an open-source library called <\/span><a href=\"https:\/\/github.com\/nipunsadvilkar\/pySBD\"><i><span style=\"font-weight: 400;\">pySBD<\/span><\/i><\/a><span style=\"font-weight: 400;\">, short for Python Sentence Boundary Disambiguation, rule-based segmentation can be applied to text. This method provides advantages over splitting on punctuation because it supports 22 different languages.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16963\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/nlp-segmentation-WNUT-300x295.png\" alt=\"\" width=\"403\" height=\"396\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/nlp-segmentation-WNUT-300x295.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/nlp-segmentation-WNUT-70x70.png 70w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/nlp-segmentation-WNUT.png 509w\" sizes=\"(max-width: 403px) 100vw, 403px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">After selecting the desired sentence boundary detection method, click <\/span><i><span style=\"font-weight: 400;\">Add\u2026 <\/span><\/i><span style=\"font-weight: 400;\">to browse for and add plain text documents. Use the <\/span><i><span style=\"font-weight: 400;\">Edit\u2026<\/span><\/i><span style=\"font-weight: 400;\"> button to adjust your selection and <\/span><i><span style=\"font-weight: 400;\">Remove\u2026<\/span><\/i><span style=\"font-weight: 400;\"> to remove documents.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Once done, click <\/span><i><span style=\"font-weight: 400;\">Finish<\/span><\/i><span style=\"font-weight: 400;\"> to produce a script that will execute the production of the training data.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16964\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/run-as-batch-program-300x242.png\" alt=\"\" width=\"555\" height=\"448\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/run-as-batch-program-300x242.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/run-as-batch-program.png 697w\" sizes=\"(max-width: 555px) 100vw, 555px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Right-click on the produced batch script and select <\/span><i><span style=\"font-weight: 400;\">Run As &gt; Batch Program<\/span><\/i><span style=\"font-weight: 400;\">. The script will run and produce a dataset for training purposes in <\/span><a href=\"https:\/\/noisy-text.github.io\/2022\/index.html#\"><span style=\"font-weight: 400;\">WNUT<\/span><\/a><span style=\"font-weight: 400;\"> format.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-16966\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/wnut-dataset2-300x297.png\" alt=\"\" width=\"300\" height=\"297\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/wnut-dataset2-300x297.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/wnut-dataset2-150x150.png 150w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/wnut-dataset2-70x70.png 70w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/wnut-dataset2.png 371w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Note that the process of creating the training dataset can take several minutes to hours depending on the size of the training data provided.<\/span><\/p>\n<h5><b>Transformer Model Trainer<\/b><\/h5>\n<p><span style=\"font-weight: 400;\">This wizard assists in the fine-tuning of PyTorch and TensorFlow NER models to improve the accuracy of PII search results. As mentioned, <\/span><span style=\"font-weight: 400;\">PyTorch and TensorFlow are machine learning frameworks.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For better accuracy and performance, it is expected that these pretrained models will be trained on datasets specific to the NER tasks you need to run. In other words, pretrained models should be fine-tuned with data similar to what will be searched in normal operations.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Hugging Face is an American company noted for its transformers library built for natural language processing (NLP) applications. Its platform allows users to share machine learning models and datasets.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Hugging Face transformers give DarkShield the ability to utilize AI-powered search matchers to discover and categorize sensitive information like names, places, and organizations. The HuggingFace <\/span><a href=\"https:\/\/huggingface.co\/models?pipeline_tag=token-classification&amp;sort=downloads\"><span style=\"font-weight: 400;\">Model Hub<\/span><\/a><span style=\"font-weight: 400;\"> provides access to thousands of pretrained models for a variety of use cases and languages.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16968\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/setup-transformer-fine-tuning-300x260.png\" alt=\"\" width=\"423\" height=\"366\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/setup-transformer-fine-tuning-300x260.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/setup-transformer-fine-tuning.png 508w\" sizes=\"(max-width: 423px) 100vw, 423px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The steps on the first page of the <\/span><i><span style=\"font-weight: 400;\">Transformer Model Trainer<\/span><\/i><span style=\"font-weight: 400;\"> wizard are the same as the steps on the first page of the <\/span><i><span style=\"font-weight: 400;\">Preprocess Training Data (WNUT)<\/span><\/i><span style=\"font-weight: 400;\"> wizard<\/span><i><span style=\"font-weight: 400;\">.<\/span><\/i><span style=\"font-weight: 400;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Once finished click <\/span><i><span style=\"font-weight: 400;\">Next<\/span><\/i><span style=\"font-weight: 400;\"> to move to the next page of the wizard where the data set fine-tuning begins.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16969\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/fine-tuning-training-data-217x300.png\" alt=\"\" width=\"485\" height=\"671\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/fine-tuning-training-data-217x300.png 217w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/fine-tuning-training-data.png 507w\" sizes=\"(max-width: 485px) 100vw, 485px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">On this page, the fine-tuning of training data is a multi-step process.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Steps:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Review the words that have been organized and assigned labels in each individual folder tab at the bottom. Each folder tab represents a type of entity and provides both a scroll bar and a search box. The search entry will filter words by matches to words starting with characters provided in the search box or by words ending with characters entered into the search box if .* precedes characters in the search box. ( .*rot matches to carrot, parrot, rot )\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">If any words need to be edited or modified, select the word and click <\/span><i><span style=\"font-weight: 400;\">Edit<\/span><\/i><span style=\"font-weight: 400;\"> to open a dialog that will allow the modification of the original text value.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Any misassignment of words to labels can be undone by selecting a word and then clicking <\/span><i><span style=\"font-weight: 400;\">Unlabel. <\/span><\/i><span style=\"font-weight: 400;\">This will un-assign the word and place it in a pool of words under <\/span><i><span style=\"font-weight: 400;\">Unlabeled words<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Words placed in the <\/span><i><span style=\"font-weight: 400;\">Unlabeled words<\/span><\/i><span style=\"font-weight: 400;\"> table can be reassigned to different labels by selecting the desired folder tab associated with a label, selecting the desired word in the <\/span><i><span style=\"font-weight: 400;\">Unlabeled words<\/span><\/i><span style=\"font-weight: 400;\"> table, and clicking <\/span><i><span style=\"font-weight: 400;\">Label <\/span><\/i><span style=\"font-weight: 400;\">to reassign the word to a different label.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Once reassignments are finished, click <\/span><i><span style=\"font-weight: 400;\">Next 1000 Sentences<\/span><\/i><span style=\"font-weight: 400;\">\u2026 to process the next chunk of sentences, or click <\/span><i><span style=\"font-weight: 400;\">Complete<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Repeat steps 1-5 as many times as necessary until you decide to click <\/span><i><span style=\"font-weight: 400;\">Complete.<\/span><\/i><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Once fine-tuning of data is complete, click the <\/span><i><span style=\"font-weight: 400;\">Finish<\/span><\/i><span style=\"font-weight: 400;\"> button at the bottom.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">After the wizard page is completed a file called finetuned_dataset.txt will be produced along with a batch script containing instructions to train a model with the newly produced dataset. Right-click on the produced batch script and select <\/span><i><span style=\"font-weight: 400;\">Run As &gt; Batch Program<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16970\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/batch-program-300x269.png\" alt=\"\" width=\"514\" height=\"461\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/batch-program-300x269.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/batch-program.png 598w\" sizes=\"(max-width: 514px) 100vw, 514px\" \/><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">The batch script will instruct a model to start the training process using the fine-tuned dataset. This process can take several minutes to hours based on the size of the finetuned_dataset.txt file. You can track the progress of the training from the console log; for example:<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16971\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/console-300x183.png\" alt=\"\" width=\"659\" height=\"402\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/console-300x183.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/console-768x468.png 768w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/console.png 784w\" sizes=\"(max-width: 659px) 100vw, 659px\" \/><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">When training is complete, the console log will report the training has successfully finished and a new model that has been trained on the data will be produced.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16972\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/Console-log-reports-training-300x52.png\" alt=\"\" width=\"985\" height=\"171\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/Console-log-reports-training-300x52.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/Console-log-reports-training-768x132.png 768w\" sizes=\"(max-width: 985px) 100vw, 985px\" \/><\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\">Console log reports training has successfully finished<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">Once the new model is created, place the model.bin along with the config.json file that was produced inside a folder. At this point, the newly trained model can now be assigned to data classes and used by DarkShield for NER matching processes.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-16973\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/produced-file.png\" alt=\"\" width=\"265\" height=\"127\" \/><\/p>\n<p style=\"text-align: center;\"><i><\/i><i><span style=\"font-weight: 400;\">After training is completed these files are produced.<\/span><\/i><\/p>\n<h5><b>NER Results Preview<\/b><\/h5>\n<p><span style=\"font-weight: 400;\">In the Workbench there is a view called <\/span><i><span style=\"font-weight: 400;\">NER Results Preview. <\/span><\/i><span style=\"font-weight: 400;\">\u00a0This panel allows you to test the accuracy and performance of NER models on free-floating text. The text will be categorized based on the context of the sentence and words will receive labels and a probability score.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-16974\" src=\"\/blog\/wp-content\/uploads\/2024\/01\/NER-Results-Preview-Workbench-View-300x208.png\" alt=\"\" width=\"567\" height=\"393\" srcset=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Results-Preview-Workbench-View-300x208.png 300w, https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Results-Preview-Workbench-View.png 607w\" sizes=\"(max-width: 567px) 100vw, 567px\" \/><\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\">NER Results Preview Workbench View<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">If you have any questions or need help using these wizards to improve your search results, please email <\/span><a href=\"mailto:darkshield@iri.com\"><span style=\"font-weight: 400;\">darkshield@iri.com<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Natural Language Processing (NLP) task wizards in the IRI Workbench GUI for DarkShield are designed to help you improve the accuracy of finding PII in unstructured sources using content-aware search matchers, called Data Matchers, during DarkShield data discovery operations. The results from NLP and other supported search matchers are serialized in PII location annotation and<\/p>\n<div><a class=\"btn-filled btn\" href=\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/\" title=\"Training NER Models in IRI DarkShield\">Read More<\/a><\/div>\n","protected":false},"author":152,"featured_media":16956,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[8],"tags":[1386,221,1656,1655,1728],"class_list":["post-16955","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-protection","tag-darkshield","tag-iri-workbench-gui","tag-named-entity-recognition","tag-ner","tag-nlp-task-wizards"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Training NER Models in IRI DarkShield - IRI<\/title>\n<meta name=\"description\" content=\"Natural Language Processing (NLP) task wizards in the IRI Workbench GUI for DarkShield are designed to help you improve the accuracy of finding PII in unstructured sources during DarkShield data discovery operations. The results from NLP and other supported search matchers are serialized in location annotation and log files which are used for data classification, masking, and reporting.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Training NER Models in IRI DarkShield - IRI\" \/>\n<meta property=\"og:description\" content=\"Natural Language Processing (NLP) task wizards in the IRI Workbench GUI for DarkShield are designed to help you improve the accuracy of finding PII in unstructured sources during DarkShield data discovery operations. The results from NLP and other supported search matchers are serialized in location annotation and log files which are used for data classification, masking, and reporting.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/\" \/>\n<meta property=\"og:site_name\" content=\"IRI\" \/>\n<meta property=\"article:published_time\" content=\"2024-01-05T02:10:25+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-01-05T12:39:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png\" \/>\n\t<meta property=\"og:image:width\" content=\"768\" \/>\n\t<meta property=\"og:image:height\" content=\"368\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Adam Lewis\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Adam Lewis\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/\"},\"author\":{\"name\":\"Adam Lewis\",\"@id\":\"https:\/\/beta.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e\"},\"headline\":\"Training NER Models in IRI DarkShield\",\"datePublished\":\"2024-01-05T02:10:25+00:00\",\"dateModified\":\"2024-01-05T12:39:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/\"},\"wordCount\":1381,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png\",\"keywords\":[\"DarkShield\",\"IRI Workbench GUI\",\"Named entity recognition\",\"NER\",\"NLP task wizards\"],\"articleSection\":[\"Data Masking\/Protection\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/\",\"url\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/\",\"name\":\"Training NER Models in IRI DarkShield - IRI\",\"isPartOf\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png\",\"datePublished\":\"2024-01-05T02:10:25+00:00\",\"dateModified\":\"2024-01-05T12:39:16+00:00\",\"description\":\"Natural Language Processing (NLP) task wizards in the IRI Workbench GUI for DarkShield are designed to help you improve the accuracy of finding PII in unstructured sources during DarkShield data discovery operations. The results from NLP and other supported search matchers are serialized in location annotation and log files which are used for data classification, masking, and reporting.\",\"breadcrumb\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#primaryimage\",\"url\":\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png\",\"contentUrl\":\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png\",\"width\":768,\"height\":368},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/beta.iri.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Training NER Models in IRI DarkShield\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/beta.iri.com\/blog\/#website\",\"url\":\"https:\/\/beta.iri.com\/blog\/\",\"name\":\"IRI\",\"description\":\"Total Data Management Blog\",\"publisher\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/beta.iri.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/beta.iri.com\/blog\/#organization\",\"name\":\"IRI\",\"url\":\"https:\/\/beta.iri.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/beta.iri.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"contentUrl\":\"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png\",\"width\":750,\"height\":206,\"caption\":\"IRI\"},\"image\":{\"@id\":\"https:\/\/beta.iri.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/beta.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e\",\"name\":\"Adam Lewis\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/beta.iri.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/087667d0c75d33bb6fab6e734bd89333?s=96&d=blank&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/087667d0c75d33bb6fab6e734bd89333?s=96&d=blank&r=g\",\"caption\":\"Adam Lewis\"},\"url\":\"https:\/\/beta.iri.com\/blog\/author\/adaml\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Training NER Models in IRI DarkShield - IRI","description":"Natural Language Processing (NLP) task wizards in the IRI Workbench GUI for DarkShield are designed to help you improve the accuracy of finding PII in unstructured sources during DarkShield data discovery operations. The results from NLP and other supported search matchers are serialized in location annotation and log files which are used for data classification, masking, and reporting.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/","og_locale":"en_US","og_type":"article","og_title":"Training NER Models in IRI DarkShield - IRI","og_description":"Natural Language Processing (NLP) task wizards in the IRI Workbench GUI for DarkShield are designed to help you improve the accuracy of finding PII in unstructured sources during DarkShield data discovery operations. The results from NLP and other supported search matchers are serialized in location annotation and log files which are used for data classification, masking, and reporting.","og_url":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/","og_site_name":"IRI","article_published_time":"2024-01-05T02:10:25+00:00","article_modified_time":"2024-01-05T12:39:16+00:00","og_image":[{"width":768,"height":368,"url":"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png","type":"image\/png"}],"author":"Adam Lewis","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Adam Lewis","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#article","isPartOf":{"@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/"},"author":{"name":"Adam Lewis","@id":"https:\/\/beta.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e"},"headline":"Training NER Models in IRI DarkShield","datePublished":"2024-01-05T02:10:25+00:00","dateModified":"2024-01-05T12:39:16+00:00","mainEntityOfPage":{"@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/"},"wordCount":1381,"commentCount":0,"publisher":{"@id":"https:\/\/beta.iri.com\/blog\/#organization"},"image":{"@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#primaryimage"},"thumbnailUrl":"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png","keywords":["DarkShield","IRI Workbench GUI","Named entity recognition","NER","NLP task wizards"],"articleSection":["Data Masking\/Protection"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/","url":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/","name":"Training NER Models in IRI DarkShield - IRI","isPartOf":{"@id":"https:\/\/beta.iri.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#primaryimage"},"image":{"@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#primaryimage"},"thumbnailUrl":"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png","datePublished":"2024-01-05T02:10:25+00:00","dateModified":"2024-01-05T12:39:16+00:00","description":"Natural Language Processing (NLP) task wizards in the IRI Workbench GUI for DarkShield are designed to help you improve the accuracy of finding PII in unstructured sources during DarkShield data discovery operations. The results from NLP and other supported search matchers are serialized in location annotation and log files which are used for data classification, masking, and reporting.","breadcrumb":{"@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#primaryimage","url":"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png","contentUrl":"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png","width":768,"height":368},{"@type":"BreadcrumbList","@id":"https:\/\/beta.iri.com\/blog\/data-protection\/named-entity-recognition-ner-in-iri-darkshield\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/beta.iri.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Training NER Models in IRI DarkShield"}]},{"@type":"WebSite","@id":"https:\/\/beta.iri.com\/blog\/#website","url":"https:\/\/beta.iri.com\/blog\/","name":"IRI","description":"Total Data Management Blog","publisher":{"@id":"https:\/\/beta.iri.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/beta.iri.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/beta.iri.com\/blog\/#organization","name":"IRI","url":"https:\/\/beta.iri.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/beta.iri.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","contentUrl":"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2019\/02\/iri-logo-total-data-management-small-1.png","width":750,"height":206,"caption":"IRI"},"image":{"@id":"https:\/\/beta.iri.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/beta.iri.com\/blog\/#\/schema\/person\/37c0e5beab094bd61cc521902df2876e","name":"Adam Lewis","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/beta.iri.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/087667d0c75d33bb6fab6e734bd89333?s=96&d=blank&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/087667d0c75d33bb6fab6e734bd89333?s=96&d=blank&r=g","caption":"Adam Lewis"},"url":"https:\/\/beta.iri.com\/blog\/author\/adaml\/"}]}},"jetpack_featured_media_url":"https:\/\/beta.iri.com\/blog\/wp-content\/uploads\/2024\/01\/NER-Darkshield-featured-image.png","_links":{"self":[{"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/posts\/16955"}],"collection":[{"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/users\/152"}],"replies":[{"embeddable":true,"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/comments?post=16955"}],"version-history":[{"count":11,"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/posts\/16955\/revisions"}],"predecessor-version":[{"id":17793,"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/posts\/16955\/revisions\/17793"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/media\/16956"}],"wp:attachment":[{"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/media?parent=16955"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/categories?post=16955"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/beta.iri.com\/blog\/wp-json\/wp\/v2\/tags?post=16955"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}