How to One Hot Encode Sequence Data in PythonIn this tutorial, we will learn to convert our input or output sequence data to a one-hot encoding for use in sequence classification. One Hot Encoding is a useful feature of machine learning because few Machine learning algorithms cannot work with categorical data directly. While working with the datasets, we come across the column that holds no specific order of preference. If we are working with a sequence classification type problem, the categorical data must be converted to numbers. This technique is also used when we work with deep learning methods such as Long Short-term Memory recurrent neural networks. First, we will discuss the Categorical Data. What is Categorical Data?Categorical Data are the types of variables that have the label value rather than numerical values. These types of variables are also called nominal. Let's see the following example of categorical data.
As we can see in the above code, some categories may have a natural relationship such as natural ordering. In the third example, the "place" variable has a natural ordering of values. Problem with Categorical DataSome machine learning algorithms have the ability to work with categorical data directly. A few algorithms cannot operate on the label data directly because they require all the data variables and output variables to be numeric. ![]() Therefore, we have to convert hierarchical data into numerical form. Suppose the categorical variable is an output variable. In that case, you may also want to change forecasts by the model back into a categorical form to represent them or use them in some application. How to Convert Categorical Data to Numeric DataThere are two methods that use to convert categorical data into numerical data.
In the next section, we will discuss One-Hot Encoding. What is One Hot Encoding?A one hot encoding is used to convert the categorical variables into numeric values. Before doing further data analysis, the categorical values are mapped to integer values. Each column contains "0" or "1" corresponding to which column it has been placed. In this process, each integer value is represented as a binary vector that is all zero expect the index of the integer, which is marked with a 1. Example of a One Hot EncodingLet's understand it by using the following simple example. Suppose we have a sequence of labels with the value 'yellow' and 'red.' To convert them into the numerical value, we assign 'yellow' an integer value 1 to corresponding to its number of categories present in column and 'red' as 0. When we encounter these labels, we will assign same integer value. It is called an integer encoding. Let's see another example - Suppose there is a category called animal and it has fours values - Cat, Dog, Cow and Camel. Consider the following table which consists of animals and their corresponding categorical values. Input Table -
The output will be shown below after one hot encoding.
If we represent the above output in a vector form then it will look like as below. Cat - > [1, 0, 0, 0] Dog - > [0, 1, 0, 0] Cow - > [0, 0, 1, 0] Camel - > [0, 0, 0, 1] Why use a One Hot Encoding?One of the best advantages of One Hot encoding is that it represents categorical data to be more expressive. As we discussed earlier, many machine learning algorithms cannot be able to work with the categorical data directly, so that it needs to be converted into integer. We can use the integer value directly or where it is needed. It can solve the problem where the natural ordinal has a relationship between the categories. For example - We can assign the integer values to "weather" label, such as 'winter', 'summer' and 'Monsoon'. But there may be problems if no ordinal relationship find. If we allow the representation to lean or any such relationship, it might be damaged the learning to solve problems. Manual One Hot EncodingIn the following example, we will consider an example string of alphabet letters that will be converted into integer value. Now, we will implement one hot coding to the above given string value. Let's see the following example. Example - Output: hello python [7, 4, 11, 11, 14, 26, 15, 24, 19, 7, 14, 13] [[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] Explanation: In the above code, we have declared the input string and printed it. Next, we defined the universe of the possible input value. Then, a mapping of all possible inputs is created from the char values to integer values. We used this mapping to encode the input string. As we can see in the above output, first letter h is encoded as 7. Then, this integer coding is converted to the one hot encoding. One integer encodes character at a time. Each character has the specific index value; we marked that index of a specific character as 1. The first character is represented as a 7 in the binary vector of 27. We marked the 7th index as 1 for h. Now, we will learn to implement one hot coding using the scikit-learn library. One Hot Encode using Scikit-learnIn this example, let's assume the following output sequence of the 3 labels. An example sequence of 10 time step may be. We encode with the integer value to the above labels, such as 1, 2, 3. In the one hot encoding, we will use the binary vector with 3 values, such as [1, 0, 0]. The sequence includes the at least one example of one possible value in the sequence. We will use the scikit-learn library. We will use the LabelEncoder module from it for creating an integer encoding of labels and OneHotEncoder for creating a one hot encoding of integer encode value. Let's understand the following example. Example - Output: ['apple' 'apple' 'mango' 'apple' 'banana' 'banana' 'mango' 'apple'] [0 0 2 0 1 1 2 0] [[1. 0. 0.] [1. 0. 0.] [0. 0. 1.] [1. 0. 0.] [0. 1. 0.] [0. 1. 0.] [0. 0. 1.] [1. 0. 0.]] Explanation - In the above code, first, we have printed the sequence of labels. Then, we performed integer encoding and finally the one hot encoding. The OneHotEncoder class returns well-organized sparse encoding. But this is not efficient for the some application such as use with keras library. One Hot Encoding with KerasLet's suppose we have a sequence that is already integer encoded. We can work with the integer encoding directly or map the integer encoding on the label values. We can use the to_categorical() function to one hot encodes integer data. In this example, we have five integer values [0, 1, 2, 3, 4] and we have an input sequence of the following 15 numbers. Let's understand the following example. Example - Output: [1 4 3 3 0 3 2 2 4 0 1 2 1 4 3] [[0. 1. 0. 0. 0.] [0. 0. 0. 0. 1.] [0. 0. 0. 1. 0.] [0. 0. 0. 1. 0.] [1. 0. 0. 0. 0.] [0. 0. 0. 1. 0.] [0. 0. 1. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 0. 1.] [1. 0. 0. 0. 0.] [0. 1. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 1. 0. 0. 0.] [0. 0. 0. 0. 1.] [0. 0. 0. 1. 0.]] 1 Explanation - In the above code, we have encoded the integer encoded as the binary vectors and printed. Then, we used the Numpy argmax() function to invert the encoding on the first value in the sequence. Next TopicHow to write square root in Python
|
Python tutorial provides basic and advanced concepts of Python.
Vue.js is an open-source progressive JavaScript framework
HTML refers to Hypertext Markup Language. HTML is the gateway ...
Java is an object-oriented, class-based computer-programming language.
PHP is an open-source,interpreted scripting language.
Spring is a lightweight framework.Spring framework makes ...
JavaScript is an scripting language which is lightweight and cross-platform.
CSS refers to Cascading Style Sheets...
jQuery is a small and lightweight JavaScript library. jQuery ...
SQL is used to perform operations on the records stored in the database.
C programming is considered as the base for other programming languages.
JavaScript is an scripting language which is lightweight and cross-platform.
Vue.js is an open-source progressive JavaScript framework
ReactJS is a declarative, efficient, and flexible JavaScript library.
jQuery is a small and lightweight JavaScript library. jQuery ...
Node.js is a cross-platform environment and library for running JavaScript app...
TypeScript is a strongly typed superset of JavaScript which compiles to plain JavaScript.
Angular JS is an open source JavaScript framework by Google to build web app...
JSON is lightweight data-interchange format.
AJAX is an acronym for Asynchronous JavaScript and XML.
ES6 or ECMAScript 6 is a scripting language specification ...
Angular 7 is completely based on components.
jQuery UI is a set of user interface interactions built on jQuery...
Python tutorial provides basic and advanced concepts of Python.
Java is an object-oriented, class-based computer-programming language.
Node.js is a cross-platform environment and library for running JavaScript app...
PHP is an open-source,interpreted scripting language.
Go is a programming language which is developed by Google...
C programming is considered as the base for other programming languages.
C++ is an object-oriented programming language. It is an extension to C programming.
C# is a programming language of .Net Framework.
Ruby is an open-source and fully object-oriented programming language.
JSP technology is used to create web application just like Servlet technology.
The JSTL represents a set of tags to simplify the JSP development.
ASP.NET is a web framework designed and developed by Microsoft.
Perl is a cross-platform environment and library for running JavaScript...
Scala is an object-oriented and functional programming language.
VBA stands for Visual Basic for Applications.
Spring is a lightweight framework.Spring framework makes ...
Spring Boot is a Spring module that provides the RAD feature...
Django is a Web Application Framework which is used to develop web applications.
Servlet technology is robust and scalable because of java language.
The Struts 2 framework is used to develop MVC based web applications.
Hibernate is an open source, lightweight, ORM tool.
Solr is a scalable, ready-to-deploy enterprise search engine.
SQL is used to perform operations on the records stored in the database.
MySQL is a relational database management system based...
Oracle is a relational database management system.
SQL Server is software developed by Microsoft.
PostgreSQL is an ORDBMS.
DB2 is a database server developed by IBM.
Redis is a No SQL database which works on the concept of key-value pair.
SQLite is embedded relational database management system.
MongoDB is a No SQL database. It is an document-oriented database...
Memcached is a free, distributed memory object caching system.
Hibernate is an open source, lightweight, ORM tool.
PL/SQL is a block structured language that can have multiple blocks in it.
DBMS Tutorial is software that is used to manage the database.
Spark is a unified analytics engine for large-scale data processing...
IntelliJ IDEA is an IDE for Java Developers which is developed by...
Git is a modern and widely used distributed version control system in the world.
GitHub is an immense platform for code hosting.
SVN is an open-source centralized version control system.
Maven is a powerful project management tool that is based on POM.
Jsoup is a java html parser.
UML is a general-purpose, graphical modeling language.
RESTful Web Services are REST Architecture based Web Services.
Postman is one testing tools which is used for API testing.
JMeter is to analyze the performance of web application.
Jenkins builds and tests our software projects.
SEO stands for Search Engine Optimization.
MATLAB is a software package for mathematical computation, visualization...
Unity is an engine for creating games on multiple platforms.
Hadoop is an open source framework.
Pig is a high-level data flow platform for executing Map Reduce programs of Hadoop.
Spark is a unified analytics engine for large-scale data processing...
Spring Cloud is a framework for building robust cloud applications.
Spring Boot is a Spring module that provides the RAD feature...
AI is one of the fascinating and universal fields of Computer.
Cloud computing is a virtualization-based technology.
AWS stands for Amazon Web Services which uses distributed IT...
Microsoft Azure is a cloud computing platform...
IoT stands for Internet of Things...
Spring Cloud is a framework for building robust cloud applications.
Email:jjw.quan@gmail.com