Data Science👨‍💻: Data Reduction Techniques Using Python

Manthan Bhikadiya 💡
Geek Culture
Published in
4 min readOct 25, 2021

--

Welcome to the Data Science Blog Series. Do check out my previous blog from the data science blog series here.

Success is not Final, Failure is not Fatal

It is the Courage to Continue, That Counts.

~ WINSTON S. CHURCHILL

Data Reduction:

Since data mining is a technique that is used to handle huge amounts of data. While working with a huge volume of data, analysis became harder in such cases. In order to get rid of this, we use the data reduction technique. It aims to increase storage efficiency and reduce data storage and analysis costs.

Dimensionality Reduction:

This reduces the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction is called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).

Principal Component Analysis:

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of distinct principal components is equal to the smaller number of original variables or the number of observations minus one. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal the preceding components. The resulting vectors are an uncorrelated orthogonal basis set.

PCA is sensitive to the relative scaling of the original variables.

About Dataset:

Principal Component Analysis:

Component Projection (2D):

The explained variance tells you how much information (variance) can be attributed to each of the principal components. This is important as while you can convert 4-dimensional space to 2-dimensional space, you lose some of the variance (information) when you do this. By using the attribute explained_variance_ratio_, you can see that the first principal component contains 72.77% of the variance and the second principal component contains 23.03% of the variance. Together, the two components contain 95.80% of the information.

Component Projection (3D):

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 3 dimensions. The new components are just the three main dimensions of variation.

Variance Threshold:

https://chrisalbon.com/code/machine_learning/feature_selection

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

https://chrisalbon.com/code/machine_learning/feature_selection

t-SNE:

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. In simpler terms, t-SNE gives you a feel or intuition of how the data is arranged in a high-dimensional space. It was developed by Laurens van der Maatens and Geoffrey Hinton in 2008.

More About t-SNE…

https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1

Code:

Conclusion:

I hope you will now have understanding of Data Reduction Techniques like PCA, VarianceThreshold, and t-SNE.

More About Data Reduction.

LinkedIn:

Github:

Thanks for reading! If you enjoyed this article, please hit the clap 👏button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge. If you like my content follow me on medium I will try to post as many blogs as I can.

--

--

Manthan Bhikadiya 💡
Geek Culture

Beyond the code lies magic. 🪄 Unveiling AI's potential with Generative AI, ML, DL, NLP, CV. Explore my blog's insights!