“Repetition is the mother of learning, the father of action, which makes it the architect of accomplishment,” so says philanthropist Tony Robbins!
In analytics there are certain tasks that are regularly repeated – processing raw data, running specific analyses, updating reports – admittedly these processes consolidate learning, progress us to our end goal and therefore contribute to accomplishing insight, but they are just plain old boring! Thankfully these tasks can be made easier (and automated) through RAP – and no, we’re not referring to my hero Vanilla Ice.
So stop, collaborate, and listen, as I talk you through Reproducible Analytical Pipelines (RAPs).
Processing large volumes of data to make sure ECC shares high quality, relevant data in a secure way takes time to deliver accurately. For small datasets, this can easily be done in Excel, but Excel has its limits. Ever noticed how it doesn’t let you open files with more than a certain number of rows? Or how it constantly tries to turn everything into a date for some reason?
For the ecda mental health project the benefits of reproducibility meant improvements to data quality (through less chance of mistakes in processing) and efficiency (since the script does all the work for you).
I recently had to process several drug and alcohol reports; 250+ columns and 10,000+ rows. I needed to select the specific columns we’d agreed to share for the mental health project and add in a few calculated columns, and I’d need to do that for each individual report, then combine them into one huge csv. Surely there’s a better use of my time?
It happens that writing an R script to do all the work for me was a much better investment of my time!
In R, just a select() statement saved me looking through each Excel column and, one-by-one, deleting the ones I didn’t need.
Adding in the calculated columns in Excel would mean me writing formulas (and probably making mistakes with cell references…) condensed to a single mutate() statement in R.
Once the script worked on one report, it was no effort to run it on all the reports – just stick it all in a loop! Writing the script took maybe 30 minutes in total and saved me several hours of incredibly tedious work. But that’s not all the script did. Just by having the script, we have:
• Enabled better collaboration
• Created efficiencies
• Built trust
Reproducibility is best practice in data science, and for good reason; it ensures results can be verified and trusted. In order to collaborate effectively, it is important that other people can repeat, build on and maintain your work. If results and processes can be accurately repeated, then it creates more opportunities to develop on existing work. It’s also faster to check one single script than it is to check each spreadsheet individually. If the script is correct, then the outputs will be correct. The script stops me from needing to do things by hand, so there’s less opportunity for me to make a mistake. In data science building trust is crucial, and that means demonstrating the skills and techniques that have been applied so that others can trust your results.
The benefits of R are not limited to data processing, it can be used for analysis too and will provide the same advantages, that’s why I’ll be using R to do all the analysis in the mental health project.