I’ve made a small project every day in python for the past 30 days (minus some vacation days). I’ve learned many new packages and made a wide variety of projects, including games, computer tools, machine learning, and some science. It was a lot of fun.
Day 30: Saving the Whales
For my last project, I thought I would revisit the Whale Detection Competition on Kaggle that I competed in last year. The goal was to train a model that could detect the presence of a whale call in an audio file. The data was provided by Cornell and consisted of 2 second sound recordings from buoys out in Massachusetts Bay of either background ocean noise or a Right whale’s “up call” which starts low and rises up. The Right whale is endangered (only 400 left) and doesn’t call out very often so it can be harder to detect than, say, a Humpback whale, so better detection algorithms will help save the whales from being hit by shipping traffic.
I did pretty well last year in the competition, scoring 0.95 Area Under the Curve score (AUC) where perfect would be 1.0. I utilized the deep learning models that I learned in Geoffery Hinton’s Coursera course on Neural Networks to build the models but I did all of the work in R, which brings me to the present project.
My main tool over the years for data analysis has been Matlab at school and then at work, but last year, I learned R as an open source alternative. I took the deep belief net code that I learned from the Hinton course and retooled it from Matlab to R. I added evaluation features and hyper-parameters for controlling various learning rates and just in general kept developing that code to work on further projects.
But the R code was clunky and messy. It got harder and harder to add new features each building on previous functions. Additionally a lot of the algorithms I wanted to try out and learn were in Matlab or Python (for instance the winning solution to the whale detection challenge). This was one of the big motivations behind learning Python. So in order to fully transition from R to Python I thought I would take the time to rework the Whale detection code into Python and learn about the various data tools in the process.
Fair warning: I didn’t finish the deep learning portion of the project, but I walk through what I did complete and show how a simple model fairs with the feature set that I constructed.
Pre-processing and Feature Extraction:
The data is a zipped up folder of .aiff files, so the first thing that’s necessary to build a program to read in the files and extract whatever features are needed for the model. In R there was no direct way to read in a .aiff file so I had to run the sox tool in a .bat file to convert the files into .wav files. To my delight not only is there an easy way to read .aiff files in Python, but it is part of the standard modules – batteries included so to speak.
With the file ingested, I converted it to a
numpy array and then used
matplotlib to plot a spectrogram of the audio file. A spectrogram is way of examining the spectrum of a time signal as it evolves over time. Specifically it takes short chunks of time, computes the FFT, and then plots these snapshots of the spectrum on the y axis versus time on the x axis (amplitude of the spectrum is intensity in the image).
Here’s how to do that in code:
plt.figure(figsize=(18.,12.)) for i, file_name in enumerate(file_names[j*N_plot:(j+1)*N_plot]): f = aifc.open(os.path.join(data_loc,train_folder,file_name), 'r') str_frames = f.readframes(f.getnframes()) Fs = f.getframerate() time_data = np.fromstring(str_frames, np.short).byteswap() f.close() # spectrogram of file plt.subplot(N_plot/4, 4, i+1) Pxx, freqs, bins, im = plt.specgram(time_data,Fs=Fs,noverlap=90,cmap=plt.cm.gist_heat) plt.title(file_name+' '+file_name_to_labels[file_name])
Instead of just looking at one file, which might not be a great example and would only show either a whale call or not a whale call, I used
matplotlib to tile multiple images to get a better sense of the data. This was way easier to do then my experiences with R and being able to easily control its size was easier than Matlab.
Here’s what some of those look like:
To make this useable as inputs to the data model, I needed the raw data that went into the image that
matplotlib created, which was readily available in the data returned by the
specgram function. As is this would yield almost 3000 features per audio file which is too much for my computer to handle (there are 30,000 audio files). So I used the frequency vector and bin vector (time) to eliminate the lowest and highest frequencies as well as the beginning and ending of each clip. The result was reduced to 600 features per clip, which is more manageable.
I turned the plotting routine into a function, wrapped that in a list comprehension which looped over each file in the the list of files and finally constructed a
numpy array out of the resulting list. I used cPickle to save this to disk so I wouldn’t need to repeat it. This portion of the project took me a while to do since I had never done any of these operations before and my original whales project was quite a while ago.
Building Restricted Boltzmann Machines and Deep Belief Nets
Unfortunately, I ran out of time and was unable to complete the conversion of the stacked RBM code. I did however complete the optimize function that could perform the model updates and I was able to verify that the RBM executed properly (although I couldn’t test its efficacy).
My deep learning model is a Neural Network constructed from a Deep Belief Net, which in turn is made of stacked Restricted Boltzmann machines. Restricted Boltzmann machines (RBM) are like one layer of a neural network but they are trained in a special way, Contrastive Divergence, that doesn’t require the data labels. This unsupervised learning algorithm seeks to improve the ability of the RBM to represent the data by training it to reconstruct the the input data from the hidden layer of the network. For a better explanation of why this works, I recommend Hinton’s homepage which is full of his papers and lectures.
Here’s the Python version of the Contrastive Divergence algorithm:
def logistic(x): '''Computes the logistic''' return 1./(1 + np.exp(-x)) def sample_bernoulli(probabilities): '''Samples from a bernoulli distribution for each element of the matrix''' return np.greater(probabilities, np.random.rand(*np.shape(probabilities))).astype(np.float) def cd1(model, visible_data): '''Computes one iteration of contrastive divergence on the rbm model''' N_cases = np.shape(visible_data) #forward propagation of the inputs vis_prob_0 = visible_data vis_states_0 = sample_bernoulli(vis_prob_0) hid_prob_0 = logistic(model['W']*vis_states_0 + model['fwd_bias']) hid_states_0 = sample_bernoulli(hid_prob_0) #reverse propagation to reconstruct the inputs vis_prob_n = logistic(model['W'].T*hid_states_0 + model['rev_bias']) vis_states_n = sample_bernoulli(vis_prob_n) hid_prob_n = logistic(model['W']*vis_states_n + model['fwd_bias']) #compute how good the reconstruction was vh0 = hid_states_0 * vis_states_0.T / N_cases vh1 = hid_prob_n * vis_states_n.T / N_cases cd1_value = vh0 - vh1 model_delta = dict([('W', cd1_value), ('fwd_bias',np.mean(hid_states_0 - hid_prob_n, axis=1)), ('rev_bias',np.mean(visible_data - vis_prob_n, axis=1))]) return model_delta
A Deep Belief Network (DBN) is a stacked up version of pre-trained RBM models which can then be treated as a Neural Network and fine tuned by the standard back propagation algorithm using the data labels. Because the model is pre-trained on the data, the back propagation step doesn’t have to change as much to get a good model and because the pre-training didn’t use the labels it is less likely to overfit.
K Nearest Neighbors
To achieve some closure in this project, I ran the feature set that I built through the K nearest neighbors algorithm in scikit-learn. The results weren’t great but they were about the same as the Conrell Benchmark for the competition. I really like how easy it is to get all of the reporting tools so easily in python:
Classification report for classifier KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski, n_neighbors=5, p=2, weights=uniform): precision recall f1-score support 0 0.84 0.90 0.87 11286 1 0.61 0.48 0.54 3714 avg / total 0.78 0.79 0.79 15000 Confusion matrix: [[10119 1167] [ 1923 1791]] AUC: 0.689413478377
Overall I am very pleased with writing these algorithms in Python. I had to jump through many hoops to get the matrices to work right when I wrote it in R. For Python, the ability to use
numpy algorithms over and over again in clear and simple ways was quite nice. I think I was more slowed down by reading my old code in R then writing the Python version, although testing each bit of code to make sure it was right did also take some time. I decided to end this project early because I knew I wouldn’t be able to write good Python code if I rushed it any more than I already had, and I plan on using Python for a while so it was better to get it right. Rest assured I will follow up with the completion of the conversion to Python; after all, the Whale Detection Challenge inspired half of this blog’s name.
Final thoughts for 30 Days of Python
These 30 days have been a great experience. I did find the process quite exhausting at times and I wasn’t always sure I would get through it. I came out the other side though with a lot more knowledge of how to do useful things in Python. I’ve already started applying this knowledge at work. I hope to continue this learning process at a slower pace and also take the time to dive into some deeper projects that I thought of while doing my 30 days. I want to thank everyone who left comments, liked a post (here or on google+), or even just read what I wrote. Knowing that people were paying attention really kept me to my schedule and I learned a lot of useful information from people’s feedback.