A Multi-Modal approach towards

Music Emotion Recognition

Using Audio, Lyrics and Electrodermal Activity

Image by Gordon Johnson from Pixabay

Music Emotion Recognition (MER) has been an active research topic for decades. This article proposes a unique multi-modal approach towards Music Emotion Recognition using three feature sets —

Audio features extracted from the song,
Lyrics of the song, and
Electrodermal Activity (EDA) of humans while listening to the song.

Audio and Lyrical features of a song strongly complement each other. While audio features like rhythm and pitch of a song capture its mood and genre, there remains a semantic gap between these features and the music itself. Lyrics capture the meaning and specificity of the language, filling in the semantic gap. Electrodermal Activity (EDA) reflects the intensity of the emotional state of a human. Measuring the EDA gives a real sense of emotions that are imparted by the song in humans. This work focusses on jointly using these three aspects to closely model emotions in music.

The Dimensional Model of Emotion

The traditional emotion recognition methodology categorizes emotions into happy, sad, pleasant, aggressive, calm, or excited. This discrete classification fails to capture the complicated human cognizance, and the vast number of emotions humans experience over time. To record human emotions more precisely, The Circumplex model of Emotions by Russell brought the notion of emotion recognition in 2D space, as depicted in the figure below. The two variables on the X and the Y-axis, namely Valence (the level of pleasantness) and Arousal (the level of activeness) can record the emotions in a piece of music. Henceforth, we will follow this mapping, which puts forward two target variables for the Regression-based problem — Arousal and Valence.

Russel’s 2D Emotional Space

♬ Data Preprocessing and Feature Extraction

For this project, the PMEmo Dataset is used. It contains the emotion annotations of 794 songs and the simultaneous electrodermal activity (EDA) signals from 457 subjects. The dataset also provides lyrics of each song.

Static and Dynamic Features are available for both Audio and EDA features. Static features mean there exists one data point per song. In contrast, Dynamic features mean multiple data points per song annotated at a gap of 0.5 seconds each.

Variance in the Dataset for 1) STATIC and 2) DYNAMIC Annotations

Audio Data

The dataset provides 6373 static features and 260 dynamic features. Some of the audio-based features include rhythm, tempo and other audio signal specifications. The dataset was normalized before training.

Electrodermal Activity (EDA) Data

The observed EDA data of at least 10 people per song is provided with a 50Hz frequency. Out of all the 794 songs, true labels are available for 767 songs in static and dynamic modes. Final data was created by combining data for each song using musicID and TimeFrame. This data was then used to extract relevant features. Hence, final EDA data consists of these 767 data points. The dynamic feature set was downsampled from 50 Hz to 2 Hz to match the frequency of true labels. Both datasets were then normalized.

Representation of EDA data for one particular song for 10 different people

Some of the extracted features include Time Domain SCR (Skin Conductance Response), Time Domain statistical features, Frequency Domain statistical and band-power features.

Lyrical Data

The dataset provides lyrics of 629 songs, out of which six lyrics were in languages other than English. Out of the remaining 623 songs, the ground truth labels are available for 603 songs. Hence, the lyrics corpus comprises of these 603 songs, each of the format given below.

Format of the Lyrics provided in the PMEmo Dataset

After basic preprocessing (like removal of punctuation, removal of stopwords, lemmatization and tokenization) two kinds of features were extracted from this lyrics corpus —

Linguistic Features
The following 22 features describe the lyrics of a song semantically. These were extracted using various tools and libraries like jLyrics, SenticNet API, NLTK and Gensim, as cited at the end of this article.

TF-IDF Vectorized Features
TF-IDF is a frequency-based method for vectorizing a text corpus. It differs from other vectorization techniques like Bag of Words or CountVectorizer since it works by penalizing the common words and assigning them lower weights. The importance is given to words that rarely occur in the entire corpus but appear in good numbers in a few songs. Such a vectorization technique is more useful for lyrics since a single rare word conveys stronger emotions than multiple occurrences of common words (for example, in the chorus).

Through TF-IDF vectorization, a vocabulary of 10,384 words was developed. Using Principal Component Analysis (PCA), the dimension reduced to 576 features while incorporating 99% variance.

Final Dataset Distribution is as follows —

Final Data Distribution across the three modalities

♬ Uni-modal Training and Results

Each feature set was trained using the following Regression models, for Arousal as well as Valence —

• Lasso Regressor
• ElasticNet Regressor
• kNN Regressor
• Decision Tree Regressor
• Random Forest Regressor
• AdaBoost Regressor
• GradientBoost Regressor
• Support Vector Regressors

Training and Cross-Validation

For each feature set, 90% of the data points were used to train the models with 10-fold cross-validation for the static dataset, and 5-fold cross-validation for the dynamic dataset. The remaining 10% of data was used for testing the trained models.


Hyperparameter tuning was done using GridSearchCV. Different hyperparameters like n-estimators, alpha, max-iter, and max-depth were tuned to generate each model's optimal results.

Scoring Metric

Each model was trained for three scoring metrics, namely Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Median Absolute Error (MedAE).


The models listed above were compared for the RMSE score, which was the most optimal. The following graphs summarize the testing RMSE score for each model in static as well as dynamic mode.

RMSE Scores for Regressor Models tested on Static Feature Sets for 1) Arousal and 2) Valence
RMSE Scores for Regressor Models tested on DynamicFeature Sets for 1) Arousal and 2) Valence

♪ Observations and Analysis

Comparing the results shown above, the following inferences were taken from the general observations —

  • In general, bagging and boosting models like RandomForest, AdaBoost and GradientBoost perform much better than the rest of the models. This is because the song features have high variance and such models help reduce it (without increasing the bias).
  • Lyrical features from both TF-IDF Vectorization and the 22 Linguistic Features gave similar results for all the models, except for kNN Regressor and Polynomial SVR.
  • Overall, models trained on audio features resulted in better scores for both static and dynamic datasets (compared to EDA and Lyrical features).
  • The following table shows the best models obtained as per RMSE scoring metric, for each feature set.
Best Models as per RMSE Scoring Metric

♬ Multi-modal Training and Results

On identifying the models that best fit each feature set, the audio, lyrics, and EDA features were combined (static and dynamic separately). These merged datasets were trained using two techniques, namely Stacking and Voting Regression, as detailed below.

Modelling of 1) Stacking Regressor and 2) Voting Regressor

Stacking Regression works by using multiple models to predict results on the same data and use these predictions as inputs to an added final layer estimator.

Voting Regression works similarly by using multiple models to predict results on the same data and then averages them to return the final set of predictions.

For this work, the merged datasets were supplied to two regressors — RandomForest and AdaBoost. This has been done as per the best individual models, as seen previously in the article. In the case of Stacking Regressor, the standard Linear Regressor was used as the final layer estimator.

Similar to the uni-modal estimators, 90–10 split was done for training and testing data. The model hyperparameters were optimized using GridSearchCV. RMSE was used as the scoring metric.

♪ Results

The following are the RMSE scores obtained on Arousal and Valence (in static and dynamic modes) using Stacking Regressor and Voting Regressor.

♪ Analysis and Observations

By comparing the results of Stacking Regressor and Voting Regressor, it is seen that Stacking Regressor gave a better RMSE score for Arousal and Valence for static as well as dynamic merged datasets.

RMSE Test Scores for Ensemble Estimators

To further analyze the Stacking model, the testing data predictions were plotted against their ground truth labels, as shown below.

Actual vs Predicted Valued for Arousal and Valence using Static Features.
Actual vs Predicted Valued for Arousal and Valence using Dynamic Features.

The distribution observed from the scatterplots suggests that model with the dynamic dataset gave better results than the static model for Arousal dimension. This indicates a better correlation of Arousal dimension with the temporal variation in the songs. The same was confirmed by plotting the Residual plots for both static and dynamic datasets, as shown below.

Residual Plots for 1) Static Ensemble Model and 2) Dynamic Ensemble Model Predictions over Test Data

The residues observed in the dynamic ensemble model are generally lesser than those with the static ensemble model. An intuitive explanation for this is the larger number of data points that the dynamic dataset contains, it can model the target values more closely. Another observation is that residues in the Arousal dimension are lesser than those in the Valence dimension, which follows from the explanation provided with the scatterplots previously.

♬ Conclusion

Summarizing the overall results, we observe that the static and dynamic ensemble models work considerably better than all the individual ones trained for lyrics, audio and EDA separately. This fulfils the project's aim by suggesting a unique model that jointly uses three different aspects of music to recognize the emotions more closely than before.

Link to the GitHub repository — Click Here

♬ Future Work and Scope

  • The current dynamic model omits the lyrics of the songs for training. Techniques for extracting time frame based lyrical features could be identified to include lyrics in the dynamic model.
  • The current model uses audio and EDA features from the song’s chorus, while the lyrical tokens capture the entire song. This gap might induce subtle inconsistencies. A more generalized model could be developed in the future.
  • Once the baseline model functions meet the desired expectations, a recommendation system could be built using similarity-prediction of songs based on their emotions.

♬ Acknowledgements

This work has been achieved through an equal contribution from Dipanshu Aggarwal, Aditya Garg and myself (Khushali Verma).

We present our heartfelt thanks to our advisor, Dr Jainendra Shukla, for providing us with valuable insights, feedback and mentorship throughout this project. A special thanks is due to our teaching assistants, Ashwini B. and Anmol Singhal, who constantly guided us in the journey.

♬ References

[1] URL:http : / / jmir . sourceforge. net /jLyrics.html

[2] Erik Cambria et al. “SenticNet 6: EnsembleApplication of Symbolic and Subsymbolic AI for sentiment Analysis”. In: Proceedings of the 29thACM International Conference on Information amp; Knowledge Management. CIKM ’20. Virtual Event, Ireland: Association for Computing Machinery,2020, pp. 105–114.ISBN: 9781450368599.DOI:10.1145/3340531.3412003.

[3] Patrik N. Juslin and Petri Laukka. “Improving Emotional Communication in Music Performance through Cognitive Feedback”. In: Musicae Scientiae 4.2(2000), pp.151–183.DOI:10.1177/102986490000400202.

[4] Ricardo Malheiro et al. “Bi-Modal Music EmotionRecognition: Novel Lyrical Features and Dataset”.In: Sept. 2016.

[5] Ricardo Malheiro et al. “Music Emotion Recognition from Lyrics: A Comparative Study”. In: Sept. 2013.

[6] Rada Mihalcea and Carlo Strapparava. “Lyrics, Music, and Emotions”. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. EMNLP-CoNLL ’12. Jeju Island, Korea: Association for Computational Linguistics, 2012, pp. 590–599.

[7] James Russell. “A Circumplex Model of Affect”. In: Journal of Personality and Social Psychology39(Dec. 1980), pp.1161–1178.DOI:10.1037/h0077714.

[8] Bj ̈orn Schuller et al. “The INTERSPEECH 2013Computational Paralinguistics Challenge: Social Signals, Conflict, Emotion, Autism”. In: INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association. Lyon, France, Aug. 2013.

[9] Jainendra Shukla et al. “Feature Extraction and Selection for Emotion Recognition from Electrodermal Activity”. In: IEEE Transactions on Affective Computing PP (Feb. 2019), pp. 1–1.DOI:10.1109/TAFFC.2019.2901673.

[10]Source code for nltk.sentiment.vader. URL:https://www.nltk.org/_modules/nltk/sentiment/vader.html.

[11] Yi-Hsuan Yang and Homer H. Chen. “MachineRecognition of Music Emotion: A Review”. In: ACM Trans. Intell. Syst. Technol.3.3 (May 2012).ISSN:2157–6904.DOI:10.1145/2168752.2168754.

[12] Kejun Zhang et al. “The PMEmo Dataset for MusicEmotion Recognition”. In: Proceedings of the 2018ACM on International Conference on MultimediaRetrieval. ICMR ’18. Yokohama, Japan: Association for Computing Machinery, 2018, pp. 135–142.ISBN:9781450350464.DOI:10.1145/3206025.3206037.6

I am a sample size of one, neither statistically significant nor representative. I’m an outlier, though! ✿