ESXi itself does not virtualise the audio-card on the host so you need to use some kind of remote-access (like RDP for Windows) to get any sound.
You don't say how much your virtual machines is running but in general I usually say that ESX itself needs about 3Gb to work well and then you need to add whatever memory that you want to dedicate to your vm:s, so in that aspect 4GB sounds very little for 3 vm:s.
Can't say for sure what your bottleneck is but I would guess a combination of memory and storage.
// Linjo