Blog: how to find data for comparative analyses
While I was a visiting researcher at NCEAS, I led a roundtable on how to extract data from published sources for comparative analyses.
Phylogenetic comparative methods
In biology, we frequently perform comparisons, across all levels we can
measure: from comparing DNA sequences through different cell types,
between individuals, to across populations and species. Here, I am
going to focus on comparisons of the latter kind: phylogenetic comparisons
which give us an understanding for the diversity of strategies that
exist in nature and allow us to test ideas about adaptation.
These comparisons start from observations of currently existing
populations (though, if available, observations on ancestral
populations can be integrate). Since it is generally not feasible to
collect all these observations as part of the project, comparative
approaches tend to rely on combining already existing observations. In
this blog, I am reflecting on some of my experiences of how to
structure such a search and where to find the relevant information.
Before starting the data collection:
You have your hypotheses and made your predictions - now
you want to collect as many observations as possible to have a powerful
sample size. But how can you make your search systematic so you do not
miss any information and how do you know when to stop - while at the
same time needing to be open to the possibility that the data might not
be presented consistently in the same way? For this, it helps to
|What do you want data for?
This will be your sample and make the search systematic. So you have to
decide on a list of entries you are going to look for before you start
|What do you want data on?
This are usually the observations you are interested in. This is where
it might be helpful to have a broad definition and collect information
that you might later combine or omit.
For example, I did a project on the evolution of monogamy in mammals. I
could have searched for all instances of monogamy and thereby creating
a list of species where this behaviour occurs. However, with this
approach I would not have known when to stop (have I found all species
with monogamy?) and the term monogamy has been used inconsistently in
the past to describe different phenomena. Therefore, this is what I did:
|What do you want data for?
For as many mammalian species as possible.
For mammals, there are lists
that include all currently recognised species (more than 5400 by now).
This provides a systematic list to work through: search for
observations on the first in alphabetical order, and you know you are
done when you get to the end of the list. It will also mean that there
is no bias in which observations are being considered and unlikely that
you will miss any.
|What do you want data on?
On the social system of species
Simply searching for "monogamy" would not have worked since (i) the
term has more than one meaning (so irrelevant information would have
come up) and (ii) other terms exist which describe what I was
interested in (e.g. pair formation). Accordingly, I collected
information on the spatiotemporal distribution of females and of males,
and used this to determine which species are socially monogamous and
which are not.
There are different places that might contain relevant
observations for your study. In general, the distinction is between
secondary and primary sources.
|Compilations in secondary literature
is what I usually start and end with. To start, they give an overview
of how much information might be available and can give you ideas about
the definitions you might want to use for your variables. They can also
serve as a final check to understand whether your observations fit with
what has previously been reported.
There are three types of seconday literature. There are different types of
information in these three kinds of publications: sometimes it is more
focused on the variable you are interested at other times more on
providing natural history information on various species. They usually
contain references to the primary literature, so they are also useful
to identify where you might get information on direct observations (see
often combine information on different populations and species. They
can show the main taxonomic groups that have been studied, details on
individual species, and they can indicate how traits are being described.
discuss terminology and provide pointers to the primary literature.
They also discuss open questions which can help to place ones' own
hypothesis into context.
Online initiatives are starting to provide large amounts of information on a diversity of topics. These range from general topic efforts such as wikipedia through more specific efforts, such as ebird, the global biodiversity initiative, or the animal diversity web.
4.) data repositories
Comparative studies tend to
include their datasets alongside their publications. A way to find
their datasets is by searching through repositories. The main ones for
biological research are Data One (their search function also covers other repositories), the Knowledge Network for Biodiversity, and DataDryad. There are also other more general repositories, either by institutions or by publishers.
|Original information in primary literature
is the reality check. While the secondary literature might contain
information on many species, checking the primary literature makes sure
that the information fits with the definition you are using and that it
is correct and it might contain additional information that would be
useful later (in general, write down as much as you can).
There are different types of
primary literature and accordingly it is best to combine different
approaches to find this literature:
1) through references and cross-references
The books and reviews from the secondary literature usually link to primary information. From
there, backward (going through the reference list) and forward
(checking articles that cite this publication) can lead to additional
2) through search engines
On the one hand, there are the
search engines that find peer-reviewed scientific articles - such as
Pubmed, Google Scholar, or Web of Knowledge. Most comparative analyses
rely on information from peer-reviewed articles as the information
comes in a more standarized form. As you go through these articles, you might want to
consider that even if they report on a hypothesis that is unrelated
to your field of study they might still contain relevant information, for
example in the method section describing the study system. On the other
hand, I also use regular internet search engines - such as Duckduckgo,
Ecosia, or Google. These can be helpful to identify a diversity of
other relevant primary information, such as PhD theses, guide books
based on primary observations, or reports from conservation or
governmental agencies. I include this kind of information if the author
seems to have been on location.
3) through the library
Search engines usually do not cover books. In addition, information published several decades ago has often not been digitized. I
was lucky to have access to a well-stocked zoology library, which
contained natural history reports based on travels from the early 20th
century, edited volumes on the biology of various groups of species,
and books based on workshops.
4) through contacting scientists
your literature search
you are likely to identify which researchers have conducted work
on a given system. In case information relevant for your study is not
included in any publication, it might be worth contacting these
scientists. For most of my studies, I only included information linked
to a reference. For the data on the social system of mammals, I still
did contact other scientists: after I had an
initial classification for a large sample of species, I checked these
who had been working on various species. Conferences were a great place
for this: I managed to chat with many different researchers at a single
event, meeting during coffee breaks.
While a systematic search can be tedious, it can also be rewarding
additionally to just finding the necessary data. For one, scientists
usually complain that they never have enough time to catch up on the
relevant literature, and this gives you an opportunity to do so. It can
also give you a different appreciation of the diversity of strategies
that exist in nature. And while reading the diversity of publications
you can also come across gems - for example, I remember some early
natural history report from a French explorers of the various animals
they encountered in a South American country which included guidelines
on how to best prepare the animal for consumption (e.g. better fried
than cooked) or a report on an island rodent where the author suspected
that the animal did not have any natural predators because it would
always fall over when trying to run away.
- The Ecological Data Wiki has information on gathering information and using it.
- Manuela Gonzales has put together a great entry on sources of species-level data for animals.
- Antica Culica and colleagues wrote an article on the open data landscape in ecology and evolution.
- Researchers at NCEAS wrote a primer on including meta-data so that others (including future you) better understand the data and are able to find and reuse it.
- The PRISMA initiative have guidelines for systematic searches and reporting for meta-analyses