I gave a talk at PyLadies Montreal last week about using Kaggle for becoming a better data analyst, and my slides and iPython notebook are available on my GitHub page. Here is a recap of why I like Kaggle, although I discovered it only 2 weeks ago.
- It offers nice 101 level competitions. Machine learning beginners can safely try their hands at pandas and scikit-learn by solving compelling problems with clean and manageable datasets, complete with tutorials. There is a lot of potential for learning by trying stuff and breaking stuff here!
- Everyone can use these datasets (if they have a Kaggle account, of course). You know that dataset from work you can't share because it's confidential? Playing around with Kaggle datasets with my friends addresses my need for team work and sharing ideas.
- Learning new techniques. Either through the tutorials or the ideas floating around the internet, it's easy to test a new technique on one of these datasets. I picture myself easily using the Titanic dataset as my "Hello world" for new machine learning techniques.
- Fun weekend project. These are fun problems that can be tackled in a weekend (or over the course of a few weekends if you want to push the limits of the problem, which of course you do!). It's a nice change of pace from my usual work and the insights I gain by doing it for fun actually get integrated into my 'real' research.
And here is a recap of nice things I've discovered while playing with the Titanic data :
- "Master" is a title given to boys (before they can be called "Mister"), so that helps to fill out the missing age data with a more meaningful value, such as the median age of masters (3.5 year old, instead of 28 year old, the median of all the passengers aboard the Titanic).
- Feature importance. The scikit-learn RandomForestClassifier has a neat attribute called "feature_importances_", which gives a score for the importance of the features (the higher, the better). This can be helpful when exploring which features to include in your model.
So go ahead, download a dataset, try things, break things, have fun and see you on the leaderboard!