PhD pitfalls: Part II - Analysing data - academic life histories

PhD pitfalls: Part II - Analysing data

4/10/2018

Almost every PhD student I’ve known has had a final rush to submit their thesis. There are many reasons that cause this. Common reasons include procrastination, waiting for supervisors to give comments, or fieldwork failures. However, a common and much-underestimated time sink is the analysis step. For example, the most challenging chapter of my thesis literally took me two years to figure out how to analyse (whereas the data collection took just a few days). If I asked any student to dig up early PhD proposal timelines, I’m sure most would laugh at how little time they estimated analysing their data would take. But why does analysis take so long, and how can we better plan for this?

Written by Damien

Problems with the data
Probably the most common cause for delays happens when researchers enter their data in a way that seems logical to read, but is highly illogical for analysis. For example, a student might record observations of multiple individuals in a sampling period and, when entering their data into an Excel spreadsheet, place these observations in different columns. If the number of individuals observed in a sampling period changes, then some rows will have more columns than other rows. When the student then comes to fitting a model to their data, they are stuck! The data doesn’t fit the model (which typically requires one observation per row).

Examples like this might seem trivial, but they can quickly explode into a veritable data monster. I once knew a student who had spent several years in the field, carefully writing observations into a notebook. The student then transcribed these observations into a spreadsheet (complete with transcription errors), resulting in tens of thousands of rows of ‘data’ – mostly in the form of natural language. While the transcription process took months, extracting anything from the data took many more months, requiring considerable expertise in developing regular expressions to extract information from the inconsistent text.

What are some simple solutions? My best suggestion is to always do some preliminary analyses as the first data are coming in (or being entered). Ideally, run a test on some pilot data (e.g. the first few observations)—this will tell you exactly what format you need the data to be in. If you can, enter your data as you go and test its ability to be turned into useful information.

Problems with the model
Once the data are in a decent form, then comes the time to test the hypothesis. Lots has been written about issues with hypothesis testing arising from re-fitting models to data (e.g. see this recent blog post). I think many people do have a clear a priori test in mind when they first analyse their data, but then get bogged down in endless rounds of re-analyses because the data don’t fit the model assumptions. Think overdispersion, zero-inflation, etc. These words are what nightmares are made of for many PhD students. Worse still are situations like my thesis chapter where I spent years trying to figure out which model to even use!

The solution? Again, you can’t beat getting in there and running a pilot analysis on the first few data points. You’re not aiming to look at P values here, not really even trends, but instead look at how well you can model your data. Simply plotting the data can reveal huge amounts. For example, it will very quickly tell you if your data are censored, and if you should consider changing some of the parameters of your data collection. At this point, you shouldn’t be afraid to throw out data—the loss of a week now might be the gain of months down the road, and probably much cleaner results as well, if you can refine your study design. It’s no real secret why grant funding agencies love proposals that contain pilot data. Also don’t under-estimate the value of alternative hypothesis-testing tools, such as permutation tests, that don’t really have any modelling assumptions and are often a more direct test of a hypothesis.

What about experimental data?
Experimental data are often much, much, easier to analyse (I can’t emphasise this enough). For example, later in my PhD I collected almost exactly the same data as my dreaded multi-year nightmare chapter, but this time with an experimental treatment. Yes by then I’d learnt some analytical skills from my previous work, but that alone doesn’t explain why it took me just two hours to analyse this new dataset. The simple and clear experimental treatment gave me a contrast I could test easily, while my experimental design controlled for basically everything else. However, I often get asked for advice from students who are struggling with the analysis of their experiments. Why? The most common reason is because their ambition lead them to include multiple other treatments, generating lots of confounded data, and losing sight of the original question.

I remember someone telling me once: I don’t want to waste my time spending six days collecting this experimental data and not be able to use them to test any follow-up questions! My answer is always to remain focussed on the original question. If you can’t answer that one properly, then there are no follow-up questions. Sometimes large experiments can provide the data necessary to test other questions. However, the experimental design should never be compromised, or made more complicated than it needs to be to answer that one question. This will not only benefit your ability to turn data into results, but probably also increase the impact of your work. ‘Elegant’ experiments are what we should strive for—imagine being able to publish your results as just a bar plot (this can happen, even in top journals, if the experiment is simple enough).

Problems with the question
It is really quite surprising how often researchers collect data without having a clear question in mind. As I wrote above, collecting data with a single uncompromising question will hugely improve facilitate analysis and increase the impact of research. Often researchers have a question, but it hasn’t really been worked out very well (i.e. it is vague). This jeopardises data collection, and ultimately the clarity of the conclusions.

Can you write down your question in a single sentence? It is often really difficult to be that specific, but there are things you can do to help get there. My personal trick is to write a short talk. Questions don’t come from vacuum, they come as a result of some body of work. Develop that body of work by making some introduction slides, finding photos that illustrate the background to the work, writing some example slides for methods. Suddenly, the question can just pop-up, and with it a whole lot of clearer thinking about the process of going from your idea to your results. Here you should also consider your null hypothesis. What do you expect the data to look like if the process you are testing for wasn't present? Formalising this can help generate a lot of clarity.

Still stuck?
Sometimes the environment we’re in just isn’t very conducive to thinking critically enough about our research. It could be personal distractions or simply because the expertise you need just isn’t available. Don’t be afraid to ask for help outside of your immediate circles. I overcame the problems with my nightmare chapter not because I was clever, but because I asked for help. I visited a lab in Sweden, where they had the expertise I needed to solve my analytical problem. Since then, I’ve written a bunch of papers with the postdoc (at the time) who helped solve my problem, and he even now helps me to solve my PhD student’s problems! Don’t be shy or afraid to ask professors if you can visit their lab—I have had the privilege of hosting many students into my own lab, and I believe that the lab has gained as much from these visits as the visitors have gained from visiting from us.

1 Comment

Rutger Vos

8/10/2018 11:26:06 am

My advice would be to follow The 7 Habits of Highly Effective People (I know, classical self help book, but actually pretty spot on): "begin with the end in mind". You're collecting data because you want to demonstrate something, i.e. that it fits your hypothesis. You're going to publish this in a publication, and the whole demonstration of your point will in the end be "figure 2" or whatever. How is such a figure generated? Which analysis steps need to be performed to arrive at it? What shape does the data have to be in? Will the size of your data set give you enough power? Work backward from that so that you know what the data collection protocol needs to look like.

PhD pitfalls: Part II - Analysing data

Leave a Reply.

New blog posts
up every other Wednesday

Disclaimer:
The contents of this blog represents the personal views of the authors, and not those of our institutions or employers

PhD pitfalls: Part II - Analysing data

Leave a Reply.

​New blog posts​up every other Wednesday​​Disclaimer:The contents of this blog represents the personal views of the authors, and not those of our institutions or employers

New blog posts
up every other Wednesday

Disclaimer:
The contents of this blog represents the personal views of the authors, and not those of our institutions or employers