Categories
resources

Know Your Strengths

A memory that has always stuck with me during my career-to-date is when my data science hiring manager drew a quadrant on the white board during one of my hiring interviews. The quadrant represents the spectrum skills you need to become a data scientist.  Every data scientist sits somewhere on this landscape of skills. I was asked to draw a dot on where I thought I sat on a spectrum of four primary DS skills. The skills were: Technical/Engineering, Theoretical/Mathematical, Commercial Aptitude and Data Storytelling.

At the time, I feel I greatly overestimated my technical ability – knowing what I know now, I would have rated myself lower on this scale. Now, however, I would say I’d be offset towards technical/engineering. 

I believe this type of honest self analysis is useful to any data scientist at any stage of their career, however even more so at the early stages. This kind of self awareness of your skills set can act as a compass to direct you into what areas you need to improve on and guide towards doing work that you are most likely to succeed at. Have a go at plotting yourself on the quadrant now, and then where you would like to be in, say, 5 years from now.

The truth is employers will most likely want you to sit somewhere across multiple skills groups. What use is complex statistical analysis if you can’t extract and articulate the valuable knowledge to others in the company, be it verbally or visually?

Same goes for brilliant technical data scientists who create amazing pieces of technology that hold no commercial value? Or vice versa, a data scientist has a brilliant grasp on the intricate commercial value of a project, yet does not know where to start on delivering the project. The first years of your career will enable you to feel out what types of tasks you have the greatest aptitude for, however, some of you will already know this. And for the very rare exceptional cases, you have a high aptitude for all 4 scales of the quadrant. 

The DS skills quadrant:

Technical/Engineering 

Theoretical/Mathematical



Commercial Aptitude


Data Storytelling 

One of the modern dilemmas of aspiring data scientists today is that the job description of a data scientist is ever expanding, the possible skills that could fall under the umbrella of “Data Scientist” is extremely large and growing by the day. Thus, it’s important to rank these skills by some criteria. However you choose to rank these skills, ensure that you don’t become overwhelmed by the endless number of possible tools and skills you think you require. That said there are some basic skills that are fundamental to the job.

Categories
resources

Learn from Free Videos

Some concepts in data science need to be explained to you in a clear and human way. Aside from having a tutor, mentor or teacher then the next best thing can be to watch a quality video explaining concepts to you. Luckily there are thousands of videos with varying degrees of quality explaining most of the important fundamentals. For example, you could type “How decision trees work” into YouTube, start at the top, if one video doesn’t make sense to you, you can simply try another video until you find someone who explains the concept in a way that you can understand. You get the added benefit of being able to pause a video and look up as you go along. I have found this a highly effective way to learn. Furthermore you can rewind sections of the video over and over again, going over anything that you may have missed. 

When I was learning, I created a table with important fundamental concepts and links to videos I found helpful explaining them.

Many paid online course platforms such as Udemy and Coursera, deliver lessons through video however much of the content is already available for free through YouTube and other platforms. Do some research and find a channel and platform that works for you.

Categories
resources

Work on Projects Interesting to You

If you ask a data scientist what projects they’ve worked on, you’ll get a long list of paid and personal projects that all have contributed to their learning. My first real eye opening and useful data science project was writing a web scraper in Python. The goal was to scrape consumer food product information, including their ingredients to determine which contained palm oil. I built a site on top of the data that enabled people to search for products. The idea being that much of the rainforest destruction in South East Asia is tied back to unsustainable production of palm oil. Large areas of land would be cleared, most commonly through scorching perimeters of rainforest in-order to clear space for palm oil plantations. 

The project felt like a useful piece of work and kept me motivated enough to want to keep solving the challenges it faced, from how to fix encoding issues associated with raw html text to dealing with the best method to store the data. It was a project I truly believed would add value thus, I worked on it ferociously. I was willing to learn for the purpose of creating. Furthermore, you’ll find your learning is more efficient as you learn only what you need to know to get your project working. By documenting and recording these projects you’ll soon have a portfolio you can use to demonstrate to employers your technical skills.

While this was a relatively large project, the project you decide to start with can be as big or as small as you like. As long as you see that the project has some incremental value, even if it’s very small. You’ll be amazed by the number of simple projects available to you which are actually useful. Here are some ideas:

  • Python Script clean up unwanted system junk on your hard drive. A script that automatically deletes files in your trash, caches, logs and downloads folders. Whenever you’re running low on disk space, run the script.
  • Python Script tell you the chances of a losing hand in poker. Great as it includes some statistics and probability theory.
  • Forecast your costs or revenue over the coming months, most banks let you download a csv file of your transactions and balance statements. Or you could create some interesting visualisations of predictions for your income and outgoings.
  • Random dinner recipe generator, scrape some recipes online then write some code to randomly pick a recipe without replacement. You’ll never run out of ideas for what to have for dinner.
  • Simulate how you spend your time, record how long it takes you to do things, create estimations of how long they would take to complete, then run scenarios of how you could spend your time over the coming weeks and months, Then pick the scenario which is optimal (however you may define that).

Some more DS advanced projects:

  • Sentiment analysis of tweets (lot’s of existing code and libraries do this).
  • Handwritten digits recognition with MNIST dataset
  • IMDB movie recommendation system 
  • Clustering user data for user segmentation

Luckily there’s plenty of open source datasets available for you to play around with, for example, kaggle.com. Just make sure it’s a project you are genuinely interested in and feel like it can provide some real world value.

Categories
resources

Get to Grips With The Terminology 

As with any technical industry buzzwords, abbreviations and obscure words synonymous with less obscure words are common. They are just part of the game and learning what all these words mean can seem daunting, in a way they act as a barrier to entry for people when they really shouldn’t be. 

Rather than write an exhaustive list which would take far too long, here are a few important examples and their definitions:

  • ETL – Extract Transform Load
  • Heuristic(s) – Finding an approximate solution when classical methods don’t work. 
  • Structured / Unstructured Data – Data that can easily be put into a table or database vs other data such as PDFs or MP3 files.
  • Supervised / Unsupervised Algorithms – Machine learning algorithms that learn with labelled training data vs those without 
  • Greedy Algorithms – An algorithm that iteratively finds the localised best option in order to get to the global optimal option.
  • Normalization –  Adjusting values measured on different scales to a common scale.
  • Residual (Error) – Deviation of the observed value from some derived value.
  • Feature engineering –  Process of transforming and creating training data for a machine learning algorithm.
  • Application containerization – The process of creating standard unit of software that packages up code and dependencies so the application runs between computing environments.

There are many more of course and terminology is ever changing. 

Just don’t take how many words you know the meaning of as an approximation for your capabilities as a data scientist. You’ll learn what they mean in time.