The following is the data science technical stack I believe every Python programmer should have in their arsenal. I may be missing one or two frameworks but this stack does the job for me. I’ve provided links to their respective main page so you can learn more about them.
PyCharm is my favorite IDE for Python programming due to its array of features. It’s built in VCS is top notch if you wanna push your work directly to remote repositories. It has incredible support for third-party web frameworks. Utilities-wise, it handles debugging, refactoring and testing in a very intelligent fashion. Link
Scikit-learn is a machine learning-focused framework that utilizes a uniform API paradigm for its algorithms (fit/transform/predict). It will be your trusty sidekick for pretty much any machine learning experiment. Though not heavily focused on statistics, scikit-learn provides amazing, fast and robust preprocessing and metric evaluation tools alongside its algorithms. It is tightly coupled with the NumPy and SciPy framework, allowing it to be the ‘brains’ of your technical stack. It doesn’t provide a very thorough and sophisticated API for deep learning like Tensorflow/Keras, but its neural network API is still pretty darn good. Link
When you’re not off training your fancy model, you’ll likely be playing with numpy arrays. NumPy is a Python framework that provides an amazing multi-dimensional array/matrix API. You will use numpy arrays to store practically everything: your dataset, your labels, your very existence. Your first few lines of code will always be setting up your empty numpy matrices. It also has pretty good linear algebra functionalities. Link
Tensorflow & Theano w/ Keras
Tensorflow and Theano are frameworks optimized with GPU capabilities making them extremely useful for deep learning tasks. Though they are a bit more complicated to use, and involve constructing your neural networks with more code than scikit-learn, there is an amazing wrapper framework called Keras that makes life easier. Keras allows you to build your network with concise and easy to read code. You’re going to need these libraries if you want to play with convolutional neural networks (CNNs). Link to Keras
In the same niche as NumPy but providing different functionalities is a flexible data structure API called pandas. Pandas lets you play with labeled datasets often known as Series (1D) and DataFrames (2D). It’s indexing capabilities are unmatched and allow for all kinds of dataset customization. It’s compatibility with CSV and Excel files will make this framework a godsend for you. Say you want to drop a column for ‘Housing_Prices’, after initializing and loading your DataFrame object df. All you do next is just type the command:df.drop('Housing_Prices', 1)
, and voila! Link
And now some honorable mentions…
- OpenCV – A cross platform framework for computer vision tasks. It provides real-time video processing and has a built-in face recognition feature (via Haar cascades). It is now compatible with many deep learning frameworks including Tensorflow!
- scikit-image – sort of like scikit-learn’s little brother. Not as sophisticated as OpenCV but its API have a similar feel to scikit-learn. I’ve used it at my work place where it did a pretty good job at blob detection.
- matplotlib – This framework should actually be up there at the top with the rest but whatever, it’s just a 2D visualization/plotting library (a really good one though). You will find yourself using it to visualize even your covariance/correlation /confusion matrices.
- Pillow – It’s a developed fork of the Python Imaging Library (PIL) and is a pretty good side tool for your computer vision tasks.
Maybe I should make a separate post focused on image processing frameworks since I’ve managed to just clump them all in the honorable mentions section. Ah well. Leave a comment, let me know what you think! and make sure you’re pip install-ing all those frameworks afterwards :).