Data Science👨‍💻: Data Preprocessing with Orange Tool

Manthan Bhikadiya 💡
4 min readSep 17, 2021

--

Welcome to the Data Science Blog Series. Do check out my previous blog from the data science blog series here.

Don’t Stop When You’re Tired, Stop When You’re Done.

~ Wesley Snipes

Overview:

This blog is 3rd Part of the Orange tool. In this blog I will be discussing about how you can use the Orange library in python and perform various data preprocessing tasks like Discretization, Continuization, Randomization, and Normalization on data with help of various Orange functions.

In the Orange tool canvas, take the Python script from the left panel and double click on it.

Python Script Widget

All the Scripts are available on Github Page.

Do Check out the link at the end of this blog.

Discretization:

Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. In other words, data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. In this example, I have taken the built-in dataset provided by Orange namely iris which classifies the flowers based on their characteristics. For performing discretization Discretize function is used.

Discretization using Python Script

Continuization:

Given a data table, return a new table in which the discretize attributes are replaced with continuous or removed.

  • binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables, depending upon the argument zero_based.
  • multinomial variables are treated according to the argument multinomial_treatment.
  • discrete attributes with only one possible value are removed.

Continuize_Indicators

The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one, and the others will be zero. This is the default behavior.

For example, as shown in the below code snippet, dataset “titanic” has featured “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.

Continuization using Python Script

Normalization:

It is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

Normalization using Python Script

Randomization:

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

Randomization using Python Script

Python Scripts Files:

Conclusion:

I hope you will learn something…

Do check out more features of the Orange tool here.

Previous blogs about Orange tool Blog1 & Blog2.

Keep Exploring…!!👍

LinkedIn:

Github:

Thanks for reading! If you enjoyed this article, please hit the clap 👏button as many times as you can ( max 50 times 😂 ). It would mean a lot and encourage me to keep sharing my knowledge. If you like my content follow me on medium I will try to post as many blogs as I can.

--

--

Manthan Bhikadiya 💡

Beyond the code lies magic. 🪄 Unveiling AI's potential with Generative AI, ML, DL, NLP, CV. Explore my blog's insights!