AI

How to Use Pandas Melt – pd.melt() for AI and Machine Learning

How to Use Pandas Melt - pd.melt() for AI and Machine Learning

What is Pandas Melt?

Pandas Melt is currently the most efficient and flexible function that is used to reshape Pandas’ data frames. It reshapes the data frames from a wide format to a long format, which makes it more useful in the field of data science. A wide format contains values that do not repeat in the first column. A long format contains values that do repeat in the first column. In code format it can be called using “pd.melt ()”.

There are seven parameters that can be used in the parentheses part of the code. These are df, id_vars, value_vars, var_name, value_name, col_level, and ignore_index. The only parameter that is required is “df” which is used to choose the data frame that you want to perform operations on. Id_vars is used to name the columns to use as identifier parameters. Value_vars is used to name the columns that will be melted. Var_name is used to name the variable column in the output. Value_name is used to name the value column in the output. Col_level is used when you need multi indexed columns. Finally, ignore_index is used to ignore or retain the original index.

This can be set to true or false. All of these parameters can be used at once and the code would look like “pd.melt (df, id_vars = None, value_vars = None, var_name = ‘variable’, value_name = ‘value’, col_level = None, ignore_index = True)”.

Also Read: Pandas and Large DataFrames: How to Read in Chunks

Long data frame vs Wide data frame

We talked about long data frames vs wide data frames above, but it is easier to understand the concept when you can see it visually. Keep in mind that wide data frames will have many columns which can become difficult to manage. Meanwhile, a long data frame will make it easier to perform machine learning on the data. Below is an example of how a wide data frame may look:

Wide Data Frame:

Person   Age   Weight   Height
——–   —–    ——–     ——–
Bob       32      168       180
Alice      24      150       175
Steve     64      144       165

In this example we have four columns. By using the melt function, we can transform this data efficiently into a long data frame as shown below:

Long Data Frame:

Person   Variable   Value
——–    ———-     ——-
Bob      Age           32
Bob      Weight      168
Bob      Height       180
Alice    Age            24
Alice    Weight       150
Alice    Height        175
Steve    Age           64
Steve    Weight      144
Steve    Height       165

Now the columns have shrunk from four to three. Now let us look at how to change a wide data frame into a long data frame using Python code. First, we need to create a wide data frame. The code to do this is shown below:

# Creating sample data

import pandas as pd# creating a dataframedf = pd.DataFrame(
{
'Item': ['Cereals', 'Dairy', 'Frozen', 'Meat'],
'Price': [100, 50, 200, 250],
'Hour_1': [5, 5, 3, 8],
'Hour_2': [8, 8, 2, 1],
'Hour_3': [7, 7, 8, 2]
}
)
print(df)

A table was created with the item’s cereal, dairy, frozen, and meat. There are five columns named items, price, hour 1, hour 2, and hour 3. This is easy to read for humans, but harder for a machine. Because of that we need to do some reshaping and change it into a long data frame. Below is an example of how the data frame would look:

Item       Price     Hour_1   Hour_2   Hour_3
——–    ——-    ——–     ——–      ——–
Cereals   100       5             8              7
Dairy        50        5             8              7
Frozen     200      3             2              8
Meat        250       8            1              2

Now let’s use Python to reshape this data frame into a long format. We will have one column containing item, one column containing hour, and one column containing sales. Below is the code on how to do that:

melt_df = pd.melt(
df,
id_vars=['Item'],
value_vars=('Hour_1','Hour_2','Hour_3'),
var_name='Hour',
value_name='Sales',
col_level=None
)melt_df

The output of this code can be seen below:

Item         Hour        Sales
———    ——–      ——-
Cereals  Hour_1       5
Dairy     Hour_1       5
Frozen   Hour_1       3
Meat      Hour_1       8
Cereals   Hour_2      8
Dairy     Hour_2        8
Frozen    Hour_2      2
Meat      Hour_2       1
Cereals   Hour_3      7
Dairy     Hour_3        7
Frozen    Hour_3      8
Meat      Hour_3       2

Now the data shrunk from five columns to three columns, which allows for easier application of machine learning on the data. For example, we can group the data by items and sales using the “group by” function. Group by is a Pandas function that allows the user to group rows according to defined values in each column. This would get us the total sales. This can easily be done in one line of code by simply typing “melt_df.groupby (`Item`) [`Sales`].sum()”. The output of this code is shown below:

Item          Sales
———    ——-
Cereals      20
Dairy          20
Frozen       13
Meat          11

This tells us how many of each item was sold. We can also group by hours to see how many sales occurred per hour. The code for this is “melt_df.groupby(`Hour`) [`Sales].sum()”. The output for this can be seen below:

Hour            Sales
——–        ——-
Hour_1      21
Hour_2      19
Hour_3      24

As you can start to see, having data in long form makes it much easier to work with. The data frame can also be updated using Pandas Melt easily. Let us try adding a new column in called price. Below is the code needed to accomplish this:

melt_df = pd.melt(
df,
id_vars=['Item','Price'],
value_vars=('Hour_1','Hour_2','Hour_3'),
var_name='Hour',
value_name='Sales',
col_level=None
)melt_df

With this now our long format data frame looks like this:

Item          Price          Hour            Sales
———    ——-     ——–           ——-
Cereals     100   Hour_1                 5
Dairy        50      Hour_1                5
Frozen      200   Hour_1               3
Meat        250    Hour_1               8
Cereals     100   Hour_2               8
Dairy        50      Hour_2              8
Frozen      200   Hour_2             2
Meat        250    Hour_2              1
Cereals     100   Hour_3              7
Dairy        50      Hour_3             7
Frozen      200   Hour_3            8
Meat        250    Hour_3             2

As you can see the new column was seamlessly added into the long data frame with no issues. Now with a price column we can calculate things like total revenue or even revenue by item or by hour. These can all be done with the group by function and the code is very similar to what is shown above.

Also Read: What is Argmax in Machine Learning?

Reversing Pandas Melt 

The Pandas Melt function can also be reversed, which allows us to go from a long data frame back to a wide data frame. This can be done using the pivot function and will get back the original data frame. The documentation for the pivot function can be found at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html. To reverse Pandas Melt, the index value of the pivot function must be the same as the ‘id_vars’ value on the data frame. The columns value must be passed as the name of the variable column. The code to do this can be seen below:

df_unmelted = melt_df.pivot_table(
index['Item','Price'],columns='Hour',values='Sales'
)df_unmelted

By doing this the data frame is now back to a wide format as seen below:

Item         Price     Hour_1      Hour_2     Hour_3
———     ——-   ——–       ——–         ——–
Cereals    100        5            8              7
Dairy         50         5            8              7
Frozen      200       3            2              8
Meat         250       8            1              2

Source: YouTube

Also Read: Artificial Intelligence and Otolaryngology.

Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud
$85.31
Buy Now
We earn a commission if you make a purchase, at no additional cost to you.
02/19/2024 06:01 am GMT

Conclusion

I hope this article has shown you the importance of Pandas Melt in the context of data science. Changing a wide data frame into a long one efficiently helps other machine learning algorithms function easier. Thank you for reading this article.    

The pd.melt() function in Pandas is an invaluable tool for data preprocessing, a crucial step in any AI or machine learning workflow. By transforming the dataset from a wide format to a long format, pd.melt() allows for more efficient data analysis and makes it easier for algorithms to interpret the data. This, in turn, improves the performance of AI and machine learning models, enabling them to make more accurate predictions.

pd.melt() provides a high level of flexibility, allowing data scientists to specify which columns to keep unchanged and which to unpivot. This granular control makes it possible to tailor the data transformation process to the specific requirements of each AI or machine learning project. Given its versatility and effectiveness, pd.melt() is a must-have tool in any data scientist’s toolkit, facilitating the development of robust and efficient AI and machine learning solutions.

References

Fenner, Mark. Machine Learning with Python for Everyone. Addison-Wesley Professional, 2019.

Molin, Stefanie, and Ken Jee. Hands-On Data Analysis with Pandas: A Python Data Science Handbook for Data Collection, Wrangling, Analysis, and Visualization. Packt Publishing Ltd, 2021.

Sarkar, Dipanjan, et al. Practical Machine Learning with Python: A Problem-Solver’s Guide to Building Real-World Intelligent Systems. Apress, 2017.