Music Genre Classification using Deep Learning
Everyone enjoys good music, be it a song that calls you to the dance floor or a song that takes you through the memory lane. Music industry has been trying to make music that appeal to a greater set of audience; this has been a daunting task as everyone’s needs cannot be met by a few types of songs. This is where artists come in and try to create a new version of the song by:
- Removing some instruments from the original music
- Adding new instruments to the original music
- Adding a specific style to the original music, etc.
The list of all the possible combinations is too long to be comprehensive. This opens a huge market in the music industry, cover songs are made for almost all the popular songs and there are many variations of a song that can be made. Huge investments are made on the artists and singers to make new cover songs.
If a cover song is made with a particular set of audience under consideration, it is quite possible that this song might not gain traction among the other groups. Is it possible to make different cover songs for different groups of audience without burning a hole in the pockets of the production house? Imagine an application where you upload your favourite song, select your favourite singer/artist and the application will play the encapsulating the singers/artists’ style? Artificial Intelligence might be the answer. We have heard of style transfer using images, can we use the concept of style transfer for music?
To achieve this task, we treat the problem as two fold, in the first part we will be dealing with identifying the genre of the song/music using song classification techniques. In the second part we will delve deeper into the concept of generating music in a different style without losing the aspects of our original music.
PART 1: Music Classifier
Sound waves are composed of compression in the medium of air followed by a rarefaction a set of compression-rarefaction combinations is perceived as sound by our ear. Music is no different in terms of the building blocks. But how do we make the computer understand the compression and rarefaction? For a computer, to apply any of the Machine Learning or Deep Learning algorithms, the data has to be a set of arrays (or tensors). Converting the song into a set of numbers can be treated as feature extraction but what features are we extracting? For our task we extracted the 39 MFCC and 5 aggregate features:
- Zero Crossing rate
- Spectral centroid
- Spectral roll off : It is a measure of the shape of the signal. It represents the frequency below which a specified percentage of the total spectral energy, e.g. 85%, lies.
- Chroma Frequencies
- Mfcc
About the Dataset
We used the GTZAN dataset for training our models. This dataset was used for the well known paper in genre classification “Musical genre classification of audio signals “ by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.
The genres are metal, disco, classical, hiphop, jazz, country, pop, blues, reggae, rock.
· The dataset consists of 100 songs per genre and there are 10 genres in the dataset which adds up to 1000 songs Each songs is in the .au format which is a mono channel audio song of 30 seconds duration.
· A song in general consists of a waveform of different amplitudes across time, we need to sample down the song so that we can discretize it, audio features are extracted from the discretized song.
We chose our metric to be per class recall as the problem at hand has 10 classes and we want the model to learn all the classes equally well.
Prior Work in this area
As a part of our research we have come across a lot of work in this area, there have been attempts to use several model architectures to solve the problem of classifying the songs in the GTZAN dataset. Some of them are mentioned below:
Even with such state-of-the-art models the recall per class recall was not stable across the genres, the models struggled to identify one genre from another. Our approach aimed at using simple models and rich features.
Implementation Details
Feature extraction:
We used python and librosa for extracting the features mentioned above. After trying a few Machine Learning models and Deep learning models on the extracted Zero Crossing rate, Spectral centroid, Spectral roll off and Chroma Frequencies along with 39 MFCC features , we had come to the conclusion that there is not enough information for a simple model to learn from the mentioned features alone.
Data Augmentation:
Before trying complex models, we wanted to increase the size of the dataset, this can be done by using data augmentation, but how do we increase the size of a dataset that is comprised of songs?
After some research, we had come across the techniques of
- Adding noise
- Temporal shift
- Temporal stretch and squeeze
Using the above mentioned techniques we increase the dataset size from 1000 songs (1.3Gb disk space) to 7020 songs (30.Gb disk space). We used scikit learn standard scaler for scaling the numpy arrays extracted from the above process.
These data augmentation techniques helped us tackle variance problem and the feature extraction techniques helped us tackle the bias problem.
Train Data size : (5608, 44) – 5608 songs, 44 featuresTest Data : (1402, 44) – 1402 songs , 44 features
Architecture of the models
We have used some basic Deep Learning models for our classification task like:
- Multi Layer perceptron
- Recurrent Neural Network
- Long Short Term Memory
- Bi — Directional Long Short Term Memory
- A Neural Network Ensemble of the above models.
- Built multiple architectures with varying levels of complexity in terms of depth, individual layers and the neurons.
- Used adam optimizer for optimization
- Dropout for regularization.
Evaluation
- Most of the deep learning models we built performed well on the training and test data with recall per class in the range of 95–98%.
- We had the task of choosing a model from these, we used new data to evaluate the models and found that neural network performed consistently.
- For the above mentioned reasons we chose neural network as the model in our deployment.

For a detailed comparison of the train and validation data performance please refer to this link.
A docker application for the above can be obtained here. This is a django application with a basic front end. The deep learning model is used for classification of songs in this application.
Stay tuned for part 2 of the blog where we delve into style transfer part.