Analyzing Data Science Job Roles and Qualifications

In General, MLJunkie by fossil11 Comments

I decided to work on a lightweight project this weekend. I’ve noticed the job titles have been blurred quite a bit in the data science market. It is not uncommon to have a data scientist and an AI engineer doing the same type of work. I do believe everything does boil down to some form of data analysis and programming (with the exception of theoretical research).

I built a simple web-scraper in Python using Selenium and Beautiful Soup, and queried the following words for 1000 job postings each on

  1. “data scientist”
  2. “machine learning”
  3. “data engineer”
  4. “data analytics”

I used Matplotlib to create these rather simple and bland bar graphs, but they did the job (I also suck at using graphical frameworks).

Since I’m parsing raw HTML to text and simply doing a frequency count, some results may be slightly biased (like ‘R’). I didn’t include the programming language ‘Go’ for this reason.

I performed the counts for each query on 4 different sets of words:

  1. Programming languages
  2. Frameworks
  3. Academia
  4. Miscellaneous

Programming Languages

Takeaway – Learn Python or R

It seems the more statistically aligned the profession is, you’re going to want Python/R, and Java/C++ on your technical stack. And the more data-oriented the profession is, you’re definitely going to want SQL, R/Python, and SAS (and possibly Scala). I guess algorithm implementation is more common in ML engineering positions, hence more emphasis on Java/C++. Nonetheless, if you’re switching over to data science: Learn Python and/or R.


Takeaway – Hadoop/AWS/Spark

Well Hadoop is the most popular framework over the 1000 job postings for each search query, followed by Spark and AWS. For machine learning, you can see TensorFlow sitting near the top as well (and I highly encourage ML enthusiasts to learn TensorFlow). Amongst deep learning frameworks, Torch, and Caffe are less popular (I’ve heard Caffe is dying). Database-oriented engineers should definitely throw on Hive and Pig/HBase onto their stack (I personally don’t know any of those, just following the results).


Takeaway – The more ML, the more nerdy

At this point, I started checking non-technical qualifications such as academia. The ‘data analytics and data engineer’  roles shared similar results, as did the ‘machine learning and data scientist’ roles. ‘Statistics’ and ‘Mathematics’ appeared about 2000 times over the ‘machine learning and data scientist’ job postings and only ~1000+ times for ‘data engineer and data analytics’. Clearly having a PhD is highly preferred in direct machine learning and scientist roles but nowadays a Masters is good enough in most cases. In fact having a bachelors degree but having tons of personal project experience in AI will get your foot in the door as well. It was also interesting to see top ML journal queries show up in the result (NIPS, ICML, etc) for the ‘machine learning’ trial. If you’re mathematically/research inclined, then you’ll be better suited for ML jobs.


Takeaway – Kaggle shouldn’t be your trump card

This category was essentially for all the words that I couldn’t exactly throw into a relevant category. So I simply lumped them all up in this misc. section, but there are some interesting results. Kafka appeared 3rd place for data engineer roles but didn’t top in any of the other queries. ‘AI’ was hardly evident in data analytics and data engineering job postings. MapReduce appeared twice as much in the data engineering role than any other role (which makes sense I guess). Surprisingly, Kaggle hardly showed up in any posting for all queries. I don’t know if this is because companies have no idea what Kaggle is or if participating in a predictive modeling competition isn’t too attractive.

Well there you have it! I’ll summarize my findings below and attach a PDF containing all the bar charts so they’re easier to compare. It was a fun mini-project that I always wanted to work on. I will upload the code up onto my GitHub.

  • At the minimum you should know either Python or R (preferably Python).
  • For the more “database-ish” roles, make sure you’re strong with SQL and big data frameworks such as Hadoop and AWS.
  • If given a choice, I’d recommend machine learning folks to learn TensorFlow over the other DL frameworks.
  • If you’re a undergraduate student and really passionate about ML, consider doing a PhD. If you graduated and to switch over to data science, consider a Masters.
  • Participate in Kaggle competitions if you’d like, but don’t let it be your main asset.
  • If you’re not into math as much, consider a data engineering or analyst role.
  • Irrelevant to the data, but don’t let Python/R be your own only go-to language. Industry applications will most likely require you to code your algorithm implementations in C/C++/Java.

Thanks for reading and leave a comment if you have any questions!


  1. Interesting, I did a similar analysis on and it came up with almost twice as many data science jobs with Python in them as R:

    I need to update the plots with some D3 framework, but haven’t had the time yet. The idea (currently functional) is that you can shift+click the bar graph to select multiple skills, and subset the jobs based on that (a feature not even dice has, surprisingly).

    I’ve got about 12-13k entries for ‘data science’ in my db from daily scraping for a while.

    You don’t have location data, do you? I found most of the dice data science jobs are CA or NY. I’m wondering what explains the huge difference in R vs Python in my data vs your data. For one thing, the skills I scrape are from the little lego block section on dice job postings (the skills listing), so less potential for stray R’s to be counted.

    1. Author

      Maybe using brute force parsing extracted the wrong ‘R’s 🙁 but I would expect it to be in the top 5 at least. I checked out your link, looks amazing and neat! I haven’t checked out dice, maybe that would have been better. Nonetheless I guess i was a bit messy., but eh it was done over a weekend.

  2. Would have been great to see suggested learning for the top items of your recommendations.

  3. Pingback: 对于数据科学的4种关键职位,哪些任职资格最重要? 2017-11-27 – Androidev

Leave a Comment