Skip to main content
  1. Data Science Blog/

Dealing with Sensitive Data

·551 words·3 mins· loading · ·
Data Security & Privacy Cybersecurity Data Science Data Security Data Privacy Data Governance Cybersecurity Data Management

On This Page

Table of Contents
Share with :

XAI

Dealing with Sensitive Data
#

Introduction
#

One of the biggest problems for a Data Science project team is to protect the data. We know, data is the basic raw material for building any supervised or unsupervised models. We cannot make a without data. Let’s assume you are working for an eCommerce retail company like Amazon; or in the health domain with some big hospital chain or government hospital; or in the banking industry with some big bank like Bank of America, Standard Charted, SBI, or ICICI Bank. They have huge data. That data can be used to develop ML models. These models can help them to serve their customers or other stakeholders in a better and efficient way. To develop these models, data should be given to the Data Science team. When the data is moving out from the production environment to the project environment, there is no guarantee that data will not fall in the wrong hands and will not be misused during the project or post the project completion. When the data is moving within the same company, there are lesser chances of this disaster. But when the company is giving data to third parties or vendors, then there are high chances.

As a client, it is my responsibility to ensure and take the measure that data do not fall in the hands of those for whom it is not intended. But even if it goes to unintended people, they should not be able to misuse that. How to ensure that the given data helps the data science team in developing a high-performing model and at the same time data is useless in the wrong hands?   For that purpose, there are many methods and 4 methods recommend in this article.  If you are interested in reading in detail you can read that. Below is the summary of those 4 techniques.

Techniques to Handle Sensitive Data
#

  1. Data Removal and Encryption : Look all the columns of all the files. Ask yourself, is this information useful for the model building? If you do not know the answer, then ask data science team. If data is not useful, then remove that file or column from the file. When we know that field is must, then we should encrypt that field. For example, a name field may be removed. Religion or address field may be encrypted.
  2. Data Coarsening: For example, if you have income filed, then consider it rounding off to a thousand or a million precision level. Consider amount field 2378959 to 2378K or 2.3M.
  3. Data Masking: Handing over full credit card number is dangerous, so consider giving last 4 digit and mask other digits. Or you can convert these 4 digits to an octave number, so 8 can be written as 10. Instead of giving address, consider low precision longitude and latitude of the address.
  4. Principal Components Analysis (PCA): PCA can be used to compress the data. For example, if your dataset has 10 columns then after compressing this using PCA you can get 2 or 3 columns which represents 99% of the data which is coming from your 10 columns. You can safely hand over this new compress data for modeling.

If you are using any other techniques, then feel free to share them in the comment box.

Dr. Hari Thapliyaal's avatar

Dr. Hari Thapliyaal

Dr. Hari Thapliyal is a seasoned professional and prolific blogger with a multifaceted background that spans the realms of Data Science, Project Management, and Advait-Vedanta Philosophy. Holding a Doctorate in AI/NLP from SSBM (Geneva, Switzerland), Hari has earned Master's degrees in Computers, Business Management, Data Science, and Economics, reflecting his dedication to continuous learning and a diverse skill set. With over three decades of experience in management and leadership, Hari has proven expertise in training, consulting, and coaching within the technology sector. His extensive 16+ years in all phases of software product development are complemented by a decade-long focus on course design, training, coaching, and consulting in Project Management. In the dynamic field of Data Science, Hari stands out with more than three years of hands-on experience in software development, training course development, training, and mentoring professionals. His areas of specialization include Data Science, AI, Computer Vision, NLP, complex machine learning algorithms, statistical modeling, pattern identification, and extraction of valuable insights. Hari's professional journey showcases his diverse experience in planning and executing multiple types of projects. He excels in driving stakeholders to identify and resolve business problems, consistently delivering excellent results. Beyond the professional sphere, Hari finds solace in long meditation, often seeking secluded places or immersing himself in the embrace of nature.

Comments:

Share with :

Related

What is a Digital Twin?
·805 words·4 mins· loading
Industry Applications Technology Trends & Future Computer Vision (CV) Digital Twin Internet of Things (IoT) Manufacturing Technology Artificial Intelligence (AI) Graphics
What is a digital twin? # A digital twin is a virtual representation of a real-world entity or …
Frequencies in Time and Space: Understanding Nyquist Theorem & its Applications
·4103 words·20 mins· loading
Data Analysis & Visualization Computer Vision (CV) Mathematics Signal Processing Space Exploration Statistics
Applications of Nyquists theorem # Can the Nyquist-Shannon sampling theorem applies to light …
The Real Story of Nyquist, Shannon, and the Science of Sampling
·1146 words·6 mins· loading
Technology Trends & Future Interdisciplinary Topics Signal Processing Remove Statistics Technology Concepts
The Story of Nyquist, Shannon, and the Science of Sampling # In the early days of the 20th century, …
BitNet b1.58-2B4T: Revolutionary Binary Neural Network for Efficient AI
·2637 words·13 mins· loading
AI/ML Models Artificial Intelligence (AI) AI Hardware & Infrastructure Neural Network Architectures AI Model Optimization Language Models (LLMs) Business Concepts Data Privacy Remove
Archive Paper Link BitNet b1.58-2B4T: The Future of Efficient AI Processing # A History of 1 bit …
Ollama Setup and Running Models
·1753 words·9 mins· loading
AI and NLP Ollama Models Ollama Large Language Models Local Models Cost Effective AI Models
Ollama: Running Large Language Models Locally # The landscape of Artificial Intelligence (AI) and …