“I can’t wait to be a student for a week!” Sophie and I said to each other as we signed up for the Institute for Analytics and Data Science (IADS) Summer School at the University of Essex in July.
You might be surprised that the words “summer school” were associated with so much excitement, since when we think of school, we probably think of assignments, long lectures, and the stressful exam period – but the IADS summer course only included the best parts of being at university, being around lots of people who are interested in the same things as you and learning something new every day. And so, my colleague Sophie and I embarked on a week-long course focused on a variety of data related topics.
While Sophie took on the world of machine learning and Bayesian statistics, I explored some of the advanced uses of the reliable R and the principles of best practice in analytics. The intensive courses, led by experts in the field, gave us both a lot to reflect on, as well as providing us with a an foundation in new skills that we are now eager to develop.
Sophie got to grips with Python throughout the course, building her skills through a series of practical and technical tasks. As any analyst will know, being exposed to a new programming language with which you have no experience can be quite the challenge, to say the least, so imagine having to pick it up within a week! But as intimidating as this seemed at first, Sophie built upon her machine learning knowledge throughout the course while also learning the basics of coding in Python. The class turned out to be a great introduction to the programming language for her, and she is now eager to keep learning how to use it. It’s no secret that Python is the go-to for machine learning algorithms, so I look forward to seeing what great ML models Sophie will create with her new Python skills!
In my case, while I began the course expecting the technical skills sessions to have the most impact on me, it was the last class on Best Practice in Analytics that left me reflecting on the way I work. I found two of the topics in the class to be particularly relevant: the importance of documenting your analytical project, and that of ensuring it is reproducible.
Since becoming an Analyst 10 months ago, I have been increasingly exposed to the very early stages of project development, and the importance of following a robust data project lifecycle. Something I have however not always considered at these stages is whether the project will need reproducing at some point in the future, and how I can make sure that can be done as easily as possible. I have learned now that it can be very easy, in the rush to gain some valuable insight for our customers, to fail to consider the fact that, at some point later down the line, when new data has been released, or a re-evaluation of our model is needed, this work will need to be reproduced – and generating it from scratch again will be a time-consuming process, which could have maybe been avoided.
There are many things we can do to make sure that doesn’t happen (such as conducting your analysis in R, Python, or other programming languages, where you can adapt old code to new data). As emphasised in the course however, in most cases, something as simple as documenting your work can ensure that. That way, even if you won’t be around when the project will need reproducing, any other analyst can refresh or build on your previous data product. So, asking yourself some questions throughout your analysis can make all the difference:
- Have I annotated my R script in a way that other analysts would be able to understand?
- Should I record the formulae/steps/data manipulation I conducted in Excel? Especially when I generated multiple spreadsheets and files?
- Should I log my reasons for choosing a particular algorithm or variables for my model?
- Should I record any unsuccessful attempts, to ensure those approaches are avoided in the future?
Documenting the entire analytical process, while maybe a bit tedious, can contribute to transparency and ensuring your work can be quickly reproduced by yourself or others, which in turn allows the focus to be on improving your data product, rather than figuring out how it was created in the first place. They call it ‘best practice’ for a reason, and I am definitely planning on implementing it more in my work from now on.
So, if you would like to explore other pillars of best practice in data analytics, try out a new programming language, build up on your R knowledge, or learn about many other processes/models/systems in the world of data science, Sophie and I couldn’t recommend the IADS summer school enough. Until next summer however, I will leave you with my first attempt at an R animated map, which, you guessed it – I created at the summer school. Till then! 😊