In this project we created a pipeline for acoustic scene classification of speech signals to the categories 'Indoors' and 'Outdoors'. Inspired by different participants in the DCASE challenge, we used an image classifying neural network along with suitable visualization of the signal. Using short time Fourier transform – STFT, we created a time-frequency image of the signals. Then, using Harmonic/Percussive source separation (HPSS) and conversion to MEL scale we realized different visualizations of the Data. We used the products of these operations as well as combinations of different visualization methods to RGB images as inputs to the GoogLeNet CNN. Fine tuning of the network learning parameters and signal visualization parameters resulted with good classification. The system reaches 90% classification accuracy on the test set. We believe that finer hyper-parameter tuning and usage of larger datasets for training may improve the classification results even further.