A Multi-Modal Emotion Recognition Framework Through The Fusion Of Speech With Visible And Infrared Images