Phylogeny vs. popularity in the birds of New Zealand

Posted by Amelia E. Barber on Sunday, November 24, 2019

What is the relationship between bird phylogeny and popularity?


The Tidy Tuesday dataset this week is on New Zealand’s “Bird of the Year 2019” so I thought it might be fun to look at how the phylogeny of the birds of NZ correlates with their popularity. Are the most popular birds closely related to each other? Or are popular birds scattered throughout the phylogenetic tree of birds in the competition?

If you are a data scientist not familiar with evolutionary biology, a phylogeny is essentially a way of classifying organisms and their relationship to each other based on their evolutionary history - often measured by DNA similarity.

The packages we’ll be using this week, in addition to the core tidyverse packages, are wikitaxa to help convert common names to scientific names, treeio to read the phylogenetic tree into R, ggtree to visualize the phylogeny, and patchwork to combine the tree with a bar chart of the vote data, so let’s load these packages and get started. Ironically, we’re also going to use the purrr package a lot in our bird data exploration.


We can also grab the data from the Tidy Tuesday repo:

nz_bird <- readr::read_csv("")

The “Bird of the Year” voting was done in a rather complicated instant runoff system over several weeks, so if we look at the raw data, we can see that we have dates, hours, and vote ranks because people could vote for up to 5 birds. To simplify this, we’ll just look at the total number of votes each bird received.

## # A tibble: 217,300 x 4
##    date        hour vote_rank bird_breed         
##    <date>     <dbl> <chr>     <chr>              
##  1 2019-10-28     8 vote_1    Gibson's Albatross 
##  2 2019-10-28     8 vote_2    Tūī                
##  3 2019-10-28     8 vote_3    Kākā               
##  4 2019-10-28     8 vote_4    Kākāpō             
##  5 2019-10-28     8 vote_5    Little Spotted Kiwi
##  6 2019-10-28     8 vote_1    Spotted Shag       
##  7 2019-10-28     8 vote_2    Fantail            
##  8 2019-10-28     8 vote_3    Weka               
##  9 2019-10-28     8 vote_4    <NA>               
## 10 2019-10-28     8 vote_5    <NA>               
## # … with 217,290 more rows

Extract the common names of the top 30 birds

Overall, there were 85 different birds in the running for “Bird of the Year”.

nz_bird %>%
    filter(! %>% 
    distinct(bird_breed) %>% 
## [1] 85

For the sake of overcrowding in our phylogenetic tree, we’re going to focus on top 30 in terms of total votes

top_birds <- nz_bird %>%
    filter(! & ! %>%
    count(bird_breed) %>%
    arrange(desc(n)) %>%
    top_n(30) %>% 

##  [1] "Yellow-eyed penguin"       "Kākāpō"                   
##  [3] "Black Robin"               "Kākā"                     
##  [5] "Banded Dotterel"           "Kea"                      
##  [7] "Tūī"                       "Blue Duck"                
##  [9] "Fantail"                   "New Zealand Falcon"       
## [11] "Kererū"                    "Morepork"                 
## [13] "Antipodean Albatross"      "Little Penguin"           
## [15] "Orange-fronted Parakeet"   "Rockhopper Penguin"       
## [17] "Little Spotted Kiwi"       "Kingfisher"               
## [19] "Southern Brown Kiwi"       "Fiordland Crested Penguin"
## [21] "South Island Kōkako"       "Mōhua"                    
## [23] "Kōkako"                    "New Zealand Dotterel"     
## [25] "Rifleman"                  "Stitchbird"               
## [27] "Takahē"                    "Bittern"                  
## [29] "Fairy Tern"                "Saddleback"

Unfortunately, the dataset only provides us with the common names of the birds, while any sort of taxonomic database is going to require (or be more accurate) if we can provide the scientific names.

Using wikitaxa to obtain scientific names from a bird’s common name

I initially tried to get the scientific names for the birds from their common name using the comm2sci() function from the taxize package, but the results were pretty incomplete and inaccurate, so I ended up using the wikitaxa package which provides an easy way to access the WikiSpecies API.

To search Wikispecies, I took a vector of the top birds, and used purrr’s map() function to search the Wikispecies database for each bird name.

bird_list <- top_birds %>% 
    set_names() %>% 

Pluck hits from Wikispecies results list

If we look at a sample result from our Wikispecies search results, we can see the resulting list has a lot of stuff we don’t really care about.

## $batchcomplete
## [1] ""
## $query
## $query$searchinfo
## $query$searchinfo$totalhits
## [1] 4
## $query$searchinfo$suggestion
## [1] "yellow eyrei pengi"
## $query$searchinfo$suggestionsnippet
## [1] "yellow <em>eyrei pengi</em>"
## $query$search
## # A tibble: 4 x 7
##      ns title       pageid   size wordcount snippet                  timestamp  
##   <int> <chr>        <int>  <int>     <int> <chr>                    <chr>      
## 1     0 Megadyptes… 3.21e4    800        74 " (Endangered) čeština:… 2019-09-18…
## 2     0 List of vi… 9.65e5 156654     14868 "<span class=\"searchma… 2018-02-15…
## 3     0 List of vi… 1.10e6 173807     16483 "<span class=\"searchma… 2018-05-23…
## 4     0 List of vi… 1.07e6 704681        61 "Begomovirus ssDNA(+/-)… 2019-12-07…

To grab the information we are most interested in (the search results themselves), we can use purrr’s aptly named pluck() function to grab just the table with our potential species names.

query_results <- map(bird_list, ~pluck(.x, "query", "search"))

As an example, here is what the database came up with for the second item in our list, the Kākāpō:

query_results[[2]] %>% 
## # A tibble: 3 x 1
##   title                  
##   <chr>                  
## 1 Strigops habroptila    
## 2 Strigops               
## 3 Ngaheremyia fuscipennis

Condense multiple Wikispecies hits down to a single (correct) scientific name

If we looked at the results for the Kākāpō above where the first search result is the correct scientific name, we might assume that it would work to just take the first result for all our searches (we can also use purrr::map_dfr() to return a table of the results instead of a list).

top_result <- map_dfr(query_results, ~head(.x, 1), .id = "common_name") %>% 
    select(common_name, latin_name = title)
## # A tibble: 27 x 2
##    common_name         latin_name                            
##    <chr>               <chr>                                 
##  1 Yellow-eyed penguin Megadyptes antipodes                  
##  2 Kākāpō              Strigops habroptila                   
##  3 Black Robin         Black robin associated gemykibivirus 1
##  4 Kākā                Nestor meridionalis                   
##  5 Kea                 Steven Chew Kea Foo                   
##  6 Tūī                 Domene tui                            
##  7 Blue Duck           Hymenolaimus malacorhynchos           
##  8 Fantail             Rhipidura maculipectus                
##  9 New Zealand Falcon  Falco novaeseelandiae                 
## 10 Kererū              Hemiphaga novaeseelandiae             
## # … with 17 more rows

However, if we take a closer look at the result, we can see that our strategy didn’t work perfectly. For example, for the black robin, it picked a virus that infects black robins instead of the actual bird itself. And for the kea, it chose Steven Chew Kea Foo, the ecologist for which (I assume) the bird was named, instead of the bird itself.

If this was a dataset with 1000 birds, I would have had to get more creative on how to fix the problem. But because there were only a few mistakes to fix, I just manually did it by combining the results for all birds into a single dataframe, excluding the hits that I knew were wrong, and then using dplyr::distinct() to condense them down to one hit per species. There were also a couple birds that wikitaxa couldn’t assign down to the species level from the common name, such as the Rifleman, so I googled them and replaced the genus name with a full species name (eg. Acanthisitta to Acanthisitta chloris).

Unfortunately, the Wikispecies search wasn’t able to find matches for 3 of our 30 birds. To figure out which birds were missing results and save them for manual entry later, I used purrr::keep().

missing_birds <- keep(query_results, is.null) %>% 
## [1] "Banded Dotterel"           "Southern Brown Kiwi"      
## [3] "Fiordland Crested Penguin"

Again, if this was part of a bigger project, I would have tried to webscrap their Latin names somehow, but because it was only 3 entries, I just googled them and made a separate tibble of the missing birds, which I combined with the Wikispecies list above.

missing_bird_names <- tibble(
    common_name = missing_birds,
    latin_name = c("Charadrius bicinctus", "Apteryx australis", "Eudyptes pachyrhynchus")

Now at the end of it, we have a table with the common and scientific names of the top 30 birds in the New Zealand Bird of the Year voting, ready for phylogenetic analysis

bird_names_complete <- bind_rows(bird_names, missing_bird_names)
## # A tibble: 30 x 2
##    common_name         latin_name                   
##    <chr>               <chr>                        
##  1 Yellow-eyed penguin Megadyptes antipodes         
##  2 Kākāpō              Strigops habroptila          
##  3 Black Robin         Petroica traversi            
##  4 Kākā                Nestor meridionalis          
##  5 Kea                 Nestor notabilis             
##  6 Tūī                 Prosthemadera novaeseelandiae
##  7 Blue Duck           Hymenolaimus malacorhynchos  
##  8 Fantail             Rhipidura maculipectus       
##  9 New Zealand Falcon  Falco novaeseelandiae        
## 10 Kererū              Hemiphaga novaeseelandiae    
## # … with 20 more rows

There are a lot of way to assign phylogenetic relationships, but because I am not a bird taxonomist and the overall goal was to look at bird popularity through the lens of species relationship, not create the most accurate NZ bird phylogeny, I decided to the use the NCBI Common Tree to generate a phylogeny, rather than building my own from available sequence data.

To do this, I exported my list of newly-assembled Latin names to a text file, and uploaded that into the NCBI Common Tree web portal

plot_set <- bird_names_complete %>% 

write_lines(plot_set, "sci_names.txt")

Resulting in this lovely phylogeny:


The fantail (Rhipidura maculipectus) was not found in the NCBI taxonomy database, but it was able to place all the other birds into a phylogeny, which I then exported as a PHYLIP file to bring back into R.

Using treeio to import and manipulate tree data

To bring the NCBI-generated phylogeny back into R, we can use treeio::read.tree(), which nicely allows us to convert the Newick text file into a tibble for tidyverse-style manipulation.

raw_tree <- treeio::read.tree(here::here("content/post/phyliptree_fix.phy")) %>% 
    as_tibble() %>% 
    mutate(label = str_replace_all(label, "_", " ")) %>% 
    left_join(bird_names_complete, by = c("label" = "latin_name")) %>%
    mutate(label = if_else(!, glue::glue("{common_name} ({label})"),

In this case, we’re adding back the common names which got discarded when the list went to NCBI and using glue() to create nice tree labels that list both the common name and the scientific name.

We can then convert this back to a phylo object and visualize a cladogram of our popular New Zealand birds using ggtree.

tree <- treeio::as.phylo(raw_tree)

ggtree(tree, branch.length = "none") +
    geom_nodelab(geom = "label", size = 3) +
    geom_tiplab() +
    coord_cartesian(clip = 'off') +
    theme(plot.margin=margin(6, 250, 6, 6))

Examining taxonomic relationship vs. bird popularity

So how does phylogenetic relationship correlate with bird popularity?

Let’s first make a graph of how many votes each of the birds in our phylogeny got by going back to our original Tidy Tuesday NZ birds dataset, counting the number of votes for each bird species, and adding their Latin name to make the same “Common name (Latin name)” label we used in our tree.

vote_df <- nz_bird %>%
    filter(! & ! %>%
    count(bird_breed) %>%
    arrange(desc(n)) %>%
    top_n(30) %>% 
    left_join(bird_names_complete, by = c("bird_breed" = "common_name")) %>% 
    mutate(label = glue::glue("{bird_breed} ({latin_name})")) %>% 
    select(label, n)

## # A tibble: 30 x 2
##    label                                          n
##    <glue>                                     <int>
##  1 Yellow-eyed penguin (Megadyptes antipodes) 12022
##  2 Kākāpō (Strigops habroptila)               11161
##  3 Black Robin (Petroica traversi)             8304
##  4 Kākā (Nestor meridionalis)                  7655
##  5 Banded Dotterel (Charadrius bicinctus)      6692
##  6 Kea (Nestor notabilis)                      6364
##  7 Tūī (Prosthemadera novaeseelandiae)         6288
##  8 Blue Duck (Hymenolaimus malacorhynchos)     6194
##  9 Fantail (Rhipidura maculipectus)            6120
## 10 New Zealand Falcon (Falco novaeseelandiae)  5992
## # … with 20 more rows

Because ggtree objects play very nicely with the tidyverse, we can use dplyr verbs to take our tree, filter the tips where our species names are, use select() to get the y coordinate of each tip, and then join that with our calculated vote dataframe to create a bar chart with y positions that match our tree. We’ll also highlight the 3 bird species who got the most votes in the New Zealand Bird of the Year competition in our phylogeny and in our plot to really distinguish them.

(the plot doesn’t look like much on its own)

p2 <- p1 %>% 
    filter(isTip) %>% 
    select(label, y) %>%
    left_join(vote_df) %>%
    mutate(highlight = if_else(n > 8000, "yes", "no")) %>% 
    ggplot(aes(y, n, fill = highlight)) + 
    geom_col(show.legend = FALSE) + 
    scale_fill_manual(values = colors) +
    coord_flip() +
    annotate("text", x = 0.5, y = 12500, label = "Total # Votes", hjust = 1) +
    xlim(0.5, 29.5) +

But using patchwork, we can combine our tree with our bar chart into one seamless plot:

p1 + p2 + plot_annotation(
    title = "Are closely related New Zealand birds equally popular?",
    subtitle = glue::glue("Among the 30 most popular birds in the 'NZ Bird of the Year competition'
                          The three bird species receiving the most votes are highlighted in teal."),
    caption = "Data: NZ Forest and Bird Organization\n Visualization @frau_dr_barber",
    theme = cowplot::theme_map()

Not really. Looking at the plot, there doesn’t seem to be a clear relationship between where a species is in the phylogeny and the number of votes they received. The three birds receiving the most votes belong to different families. The overall winner, the yellow-eyed penguin is in the Spheniscidae family, the second place kākāpō (of R4DS cover fame) is in the Psittacidae family, while the third place black robin is a passeriformes.

My guess is that bird popularity is mostly a factor of bird cuteness, which is an evolutionarily convergent trait, rather than arising from a particilar branch of the tree.

Update: Dec 10, 2019

Because I initially chose to focus on the 30 most popular birds, there were a lot of species on thir own branches that are not truly “orphan” species, so I opted to go back and remake the phylogeny with all 80 birds instead of just focusing on the most popular birds.

full phylogeny

Though a bit large and unwieldy, I think this approach provides a more complete picture of the relationship between popularity and phylogeny among the birds of New Zealand. The overall conclusions remain unchanged - there are myriad examples of nearest neighbor species having very different levels of popularity, and the top birds all belong to different families. However, with the full tree it is also evident that there is some association in the population among orders. For example, the 11 species in the Pelecaniformes, which includes pelicans, are all uniformly unpopular, while all four birds in the Psittacidae order all have above average popularity.

End note: the critically-endangered kākāpō is currently battling an aspergillosis outbreak, so if you have enjoyed this post, please consider donating to their rescue effort.