Academia to Tech

My Journey

discussion

Introduction

When Rachael from Her+ Data Manchester got in contact asking if I would be a speaker at one of their events this year she gave me the option to either speak at an event where the topic was loosely themed around data science or at an event where the topic was the transition from academia to industry. I’ve done quite a few different talks on data science topics so I decided to share my experience on taking on a tech job in industry after finishing my PhD. As well as my own experience I have also been involved in hiring and onboarding data scientists from different academic backgrounds so I feel like I can answer questions on this topic from the Her+ Data Manchester community.

Given the nature of this talk, as something I haven’t spoken about before, I wasn’t really sure where to start. I decided to write all my ideas into this blog post first (which I’ll share after my talk) before picking out the highlights to present in a 15 minute presentation. Starting with a blank canvas, I’m trying to use Joseph Allen’s advice for writing a killer blog post. I’ve got my bullet points of topics to cover on a page and my text font is set to white (slightly bizarre seeing red squiggly lines where all my spelling is bad but no words, those sitting around me on the train must be thinking I’m only pretending to do some work).

Caption: how this blog started

My Journey

To set the scene I’ll start with my journey, from going to university to starting a job in industry. A lot of my thoughts and ideas in this blog and at my talk will be biased on my own experiences so might not be completely applicable but hopefully parts of what I say will be useful nonetheless.

After school I didn’t really know what to do but given I was Scottish and higher education was free it was just assumed that I would go to university. I went to Heriot Watt University in Edinburgh to study physics, which I quickly realised I didn’t enjoy so after the first semester I transferred into a maths and physics degree. As part of this I had a statistics module with a really enthusiastic lecturer - Jennie Hansen - her love for the subject inspired me to then transfer again this time to a Maths and Statistics degree. My undergraduate was a 4 year course (as standard in Scotland) and in the summer between my third and fourth year I went to Lancaster university to do an internship at their STOR-i CDT (Statistics and Operational Research with Industry, Centre for Doctoral Training). This internship was geared up to give us an idea of what it would be like to do a PhD. As part of a cohort of 9 other interns I worked on an individual research project for the summer in which I mainly spent my time learning and using R.

Caption: My intern cohort at STOR-i plus our supervisors

Having enjoyed my summer in Lancaster I decided to return after my undergraduate course to do the combined masters and PhD programme, which the CDT offers, squeezing in a summer doing a statistics internship at Shell just before starting. My PhD looked at a statistics technique called Changepint detection.

The STOR-i CDT is set up in a way to give PhD students a good understanding of solving business problems and many PhD projects are sponsored by industrial companies. The CDT organised many events so I got the opportunity to listen to lots of presentations from people in both academia and industry. I think I knew fairly early on in my PhD that I would end up leaving academia afterwards to go into a job in industry as I enjoyed hearing more about the applied side of Statistics and OR as opposed to the theoretical developments.

During the last few months of my PhD I took on a part time job with a sports data science company which gave me a bit of experience of applying data science in the “real world” and once I finished my PhD I joined this company full time for a brief period before moving to Peak where I have been a data scientist for nearly 3 years.

Day to Day

Academia

During my PhD much of my time was spent independently researching current methods for changepoint detection and developing new methods. I would have weekly meetings with my supervisors who would give me lots of ideas and helped to guide me through my PhD. I used to read a lot of publications and books to try and understand the current theory in lots of detail and would spend time trying to theoretical prove our new methods and code up corresponding functions in R.

I was fortunate that I did my PhD at a CDT where there were 4 years of cohorts with on average 10 students, so I had the opportunity to chat to other PhD students regularly. We would have weekly presentations where a PhD student would present what they were working on as well as have regular book clubs and collaborative meetings to share ideas and provide a support network. These sessions gave me a more rounded exposure to other problems and solutions being solved in statistics and operational research than I think I would have got if I was a lone PhD student in another department.

I published 2 papers during my PhD so spent a good chunk of time researching, writing and coding as well as replying to reviewer comments. I also attended a few conferences during my time to present my work to the wider community.

Industry

In industry my day to day is much faster paced. I attend a lot more meetings and I’m encouraged to collaborate much more often. I work on multiple projects at once and have much shorter deadlines to manage. There are less long focused work periods than what you get in academia - this is something I enjoy though as I found my attention span in academia sometimes struggled with the long periods of just working on the same project.

At Peak we provide data science as a service to businesses so I get to go out and meet customers on site to get a better understanding of the business needs for data science.

We have a large data science team at Peak who are very smart and open and willing to share and learn from each other. We have regular academy and in depth sessions where a member of the team will present either a project they are working on or a technique they have found in the literature to the rest of the team.

The transition

I found the transition to industry really easy. By the end of my 3 years working on my PhD I was ready for a change. Having a part time job as a data scientist helped as it gave me an insight into working outside of academia whilst I wrote up my thesis. My first role at Sporting Data Science was home based and part of a really small company (there were two of us). Shortly after going full time with them I realised I missed the collaboration and support that I got from STOR-i so was really happy when I found Peak.

At Peak I got stuck straight into a project on forecasting. There was a good balance between researching methods to use and applying them to the business problem. The nature of this work suited me much more than trying to develop novel techniques and theoretically prove them.

Caption: working at peak

Where are the changepoints?

I still use changepoint detection on a regular basis. Partly because I find it a really useful tool, particularly when I’m doing forecasting or time-series projects and partly because they are a bit of a comfort blanket so I feel really confident to use them. However I normally just use an out of the box method for changepoint detection from the R changepoint package as opposed to using one of the methods I developed during my PhD. Changepoint detection makes up only a small part of my data science toolbox and over the years in industry I’ve broadened the techniques I use however it’ll take a lot longer to develop as deep an understanding of other methods than what I have with changepoint detection. During a PhD you build up expertise in quite a niche area. It takes some getting used to not being an expert in a field in industry, particularly if you end up in a role where you might end up working on a broad range of solutions.

KPI/ROI/B2B/B2C/HUH?

One of the challenges I faced when I first joined Peak was the business understanding. There are a lot of 3 letter acronyms thrown about. The first few weeks I was completely lost whenever anyone mentioned ROI/KPI etc. Realising there were business terms I needed to learn actually highlighted the fact that the goal posts are different in academia and industry. In academia you are trying to develop a novel method with the aim to publish in a journal and/or present at conferences. On the whole these new methods are incremental advances to what has previously been developed.

In industry it’s more important to solve problems in a cost effective way. This could mean using a more simple method than you might have done in academia but if it’s quicker and cheaper and provides a good return on investment then it might be better than a more complicated method that might have limited gains.

A picture tells a 1000 words

During my PhD I had lots of opportunities to present my work, this was something that I really value from the STOR-i CDT that they encouraged us to do lots of public speaking both internally and externally. Although presenting to my PhD peers has really helped grow my confidence in public speaking, in industry I find myself needing to explain my work a lot more to stakeholders who don’t have a technical background. This is something that I’ve had to work on over the past couple of years and what I’ve found to work well is finding ways to visualise solutions using graphs or animations with no maths notation in sight.

Real life is messy

During my PhD it was difficult getting access to good data sets. I used a lot of open source data-sets, many of which every other researcher also used, so they got overused and boring. In industry there is an abundance of data - however a lot of it is messy, unstructured, contains missing values etc. I don’t think I quite anticipated before joining Peak quite how much time in a project would be spent cleaning and transforming data. Actually a large proportion of time on any project is spent understanding the data and manipulating it in a way that is going to be useful.

The end user

When you do a PhD the end goal is to write and defend a thesis with some mini goals along the way of possibly publishing your work in a journal. Afterwards someone else might pick up your research and incrementally improve your work (or your thesis will just sit on your bookshelf gathering dust for ever more). In industry you need to think about the end user. How will someone interact with your outputs? This could be using dashboards, sending data back to the customer for them to upload to their system or deploying APIs. It’s important to think about the user early on in the project as you don’t want to develop an over complicated model which takes a lot of time to run when they need to interact with the solution in real time with results returned to them within seconds.

R or Python

The infamous R or Python data science debate is one for a much longer blog post (I won’t bother writing one as there are a lot of these around and I don’t think I have a strong argument either way). During my PhD I used R and collaborated on the changepoint R package. Given I’ve used R for over 8 years now I feel like I have a decent understanding of coding best practices and efficient ways to run R code. Recently I’ve noticed more and more people adopting the use of both languages, R for EDA/plotting and reports and Python for machine learning. I think for anyone looking to get a job in tech after academia then it’s essential to become great at at least one programming language. Once you know one language well the learning to pick up another doesn’t take as long as starting programming from the very beginning.

Is a PhD Necessary?

Personally

For me I think a PhD was necessary from the point of view that the term “Data Science” was unknown to me up until about the penultimate year of my PhD. I think had I not gone to Lancaster I probably would have gone into tax or actuarial maths after my undergraduate studies which many of my peers did.

For me a PhD has helped develop my curiosity. I like to get a deeper understanding of how methods and techniques work. There’s not many opportunities in industry to really dive deep into the inner workings of methods used.

I really value the opportunities I had during my PhD to travel to conferences to meet and collaborate with other researchers. I also had a lot of time to really develop my coding through collaborating on R packages and going on programming courses.

I believe doing a PhD in a doctoral training centre played a really big part in me progressing to a team lead role early on in my career as it gave me an opportunity to understand many different techniques that were being used. Even though I’m by far an expert in a lot of different data science tools I’ve got a very broad awareness of the sort of problems and solutions that people are interested in and are developing.

In general

I don’t think a PhD is necessary for a career in Data Science. The longer you stay in academia the more skilled you become in a very narrow area. Someone with a masters in data science probably has a broader knowledge of different data science tools than someone who has spent deep diving into a particular area - however there are benefits to having both and I think teams with a diverse mix of educational backgrounds is a really good idea. If you are currently doing a PhD or are looking to start one then take the opportunity to get involved in other things away from just working on your thesis such as presenting at conferences or collaborating with other academics. I think these extra skills will really help in the future.

Regardless of education level a tip I have for getting a data science role is to gain an understanding of how data science techniques work and apply these techniques to some “real world” (or as close to real world) examples. When interviewing for data scientists I look for people who have tried to get an understanding of the reasons they are using a technique and why the results they get are as they are. If coming from a purely academic background, particularly candidates with a masters degree, I like to see them try and work on projects outwith their day to day academic projects such as Kaggle projects. For me someone who can explain a couple of data science techniques in detail is better than someone who can list lots of different methods with no concrete examples of when they used them and how they work.

Some final thoughts and tips:

Be curious on how methods work not just how to use them.
Try and practise working on examples outwith your day to day academic projects.
Build up a toolbox for cleaning and transforming data, and doing some exploratory data analysis.
Try and attend a broad selection of talks and meetups both from academia and industry.
Practise talking about your research to friends and family.
Get good at programming - choose a language (R or python) and practise.
Research the type of role you want in industry.
- A role in industry that has a high focus on research might be an easier transition
- If this is your first role in industry it might be better to find a company where you won’t be the first data scientist
Understand that industry is different from academia. You might be an expert in an area in academia but there will be lots to learn in industry.

discussion