I decided to work on a lightweight project this weekend. I’ve noticed the job titles have been blurred quite a bit in the data science market. It is not uncommon to have a data scientist and an AI engineer doing the same type of work. I do believe everything does boil down to some form of data analysis and programming (with the exception of theoretical research).
I built a simple web-scraper in Python using Selenium and Beautiful Soup, and queried the following words for 1000 job postings each on Indeed.com:
- “data scientist”
- “machine learning”
- “data engineer”
- “data analytics”
I used Matplotlib to create these rather simple and bland bar graphs, but they did the job (I also suck at using graphical frameworks).
Since I’m parsing raw HTML to text and simply doing a frequency count, some results may be slightly biased (like ‘R’). I didn’t include the programming language ‘Go’ for this reason.
I performed the counts for each query on 4 different sets of words:
- Programming languages
Takeaway – Learn Python or R
It seems the more statistically aligned the profession is, you’re going to want Python/R, and Java/C++ on your technical stack. And the more data-oriented the profession is, you’re definitely going to want SQL, R/Python, and SAS (and possibly Scala). I guess algorithm implementation is more common in ML engineering positions, hence more emphasis on Java/C++. Nonetheless, if you’re switching over to data science: Learn Python and/or R.
Takeaway – Hadoop/AWS/Spark
Well Hadoop is the most popular framework over the 1000 job postings for each search query, followed by Spark and AWS. For machine learning, you can see TensorFlow sitting near the top as well (and I highly encourage ML enthusiasts to learn TensorFlow). Amongst deep learning frameworks, Torch, and Caffe are less popular (I’ve heard Caffe is dying). Database-oriented engineers should definitely throw on Hive and Pig/HBase onto their stack (I personally don’t know any of those, just following the results).
Takeaway – The more ML, the more nerdy
At this point, I started checking non-technical qualifications such as academia. The ‘data analytics and data engineer’ roles shared similar results, as did the ‘machine learning and data scientist’ roles. ‘Statistics’ and ‘Mathematics’ appeared about 2000 times over the ‘machine learning and data scientist’ job postings and only ~1000+ times for ‘data engineer and data analytics’. Clearly having a PhD is highly preferred in direct machine learning and scientist roles but nowadays a Masters is good enough in most cases. In fact having a bachelors degree but having tons of personal project experience in AI will get your foot in the door as well. It was also interesting to see top ML journal queries show up in the result (NIPS, ICML, etc) for the ‘machine learning’ trial. If you’re mathematically/research inclined, then you’ll be better suited for ML jobs.
Takeaway – Kaggle shouldn’t be your trump card
This category was essentially for all the words that I couldn’t exactly throw into a relevant category. So I simply lumped them all up in this misc. section, but there are some interesting results. Kafka appeared 3rd place for data engineer roles but didn’t top in any of the other queries. ‘AI’ was hardly evident in data analytics and data engineering job postings. MapReduce appeared twice as much in the data engineering role than any other role (which makes sense I guess). Surprisingly, Kaggle hardly showed up in any posting for all queries. I don’t know if this is because companies have no idea what Kaggle is or if participating in a predictive modeling competition isn’t too attractive.
Well there you have it! I’ll summarize my findings below and attach a PDF containing all the bar charts so they’re easier to compare. It was a fun mini-project that I always wanted to work on. I will upload the code up onto my GitHub.
- At the minimum you should know either Python or R (preferably Python).
- For the more “database-ish” roles, make sure you’re strong with SQL and big data frameworks such as Hadoop and AWS.
- If given a choice, I’d recommend machine learning folks to learn TensorFlow over the other DL frameworks.
- If you’re a undergraduate student and really passionate about ML, consider doing a PhD. If you graduated and to switch over to data science, consider a Masters.
- Participate in Kaggle competitions if you’d like, but don’t let it be your main asset.
- If you’re not into math as much, consider a data engineering or analyst role.
- Irrelevant to the data, but don’t let Python/R be your own only go-to language. Industry applications will most likely require you to code your algorithm implementations in C/C++/Java.
Thanks for reading and leave a comment if you have any questions!