Home > Maps > Indo-European cognates

General comments

The data comes from the Indo-European Lexical Cognacy Database (IELex). I've made no changes to the cognacy judgements listed there, though they occasionally they surprise me. For example, the Persian for bad is bad, but the IELex page says that it was "judged unrelated with Engl." Another example: the Urdu for breast is پِستان, and the Persian is pistān. These are either identical or nearly so, and Urdu has some Persian influence, but nevertheless they are assigned to different cognate classes, and I won't over-rule the historical linguists who spent a lot of time tracing sound changes through history.

What I did do was throw out a lot of the lines in the tables. For example, English has the words wide and broad, and I threw out broad so that only the cognate class for wide appears in that map over England. Most of the time, when encountering multiple words with the same (or similar) meaning in a language, I kept the first one listed and discarded the rest. But I preferentially discarded loanwords, words with multiple or irregular cognate classes, and words which were written with an unusual style (block capitals when most words for that language were in lower case, say). The important thing is, it was not methodologically sound for academic purposes, and my maps should be treated in that light.

Sometimes a word is assigned to more than one cognate class. This is common for because in some Romance languages – parce que has a cognate class for parce and a cognate class for que. I omitted such words from the map, but included them in the tables.

In cases where two or more words were separated by commas or slashes (i.e., multiple words in the same cognate class), I usually kept only the first one. Often this presumably means keeping the masculine form of the word; I suspect for Albanian I've often kept the feminine form.

The IELex database takes data from several earlier works, and these do not have uniform formatting. Some words are written in block capitals, some in lower case, some with diacritics and some without; most words in Magadhi are written in Devanagari script but not all, the Hindi words are written with the Latin alphabet, and so on. I made all the non-German words lower-case for uniformity, but didn't change any alphabets. I can't read Devanagari script at all, and I hope I didn't introduce any glaring errors while tidying up commas and slashes and so on.

To define regions for each language, I used administrative boundaries. In many cases, these are national boundaries (Norway for Norwegian, etc.), and for these I used the world map shapefile at Thematic Mapping. When working with smaller divisions of a country, I used the shapefiles from DIVA-GIS. The polygons in the latter set of files are much more detailed than in the former, and when merging them together there were often gaps that I manually edited. I've made my shapefiles available for download but while they're good enough for the cognate class maps, they're not great and the boundaries don't always line up.

Language regions were based on what I could gather from Wikipedia articles (see map). I hope I haven't mangled things too badly. I didn't find a place for Lahnda; if a Wikipedia article had shown where it was spoken then I might have carved out a piece in the west of Punjab Province for it, but in the end I left it in the tables and omitted in from the maps. Urdu, with around a hundred million speakers, got consigned to a couple of provinces in the north of Pakistan (making room for Punjabi, Balochi, Sindhi, Pashto), but at least it's distinctive in the large maps by being written in the Persian alphabet.

If anyone wants a detailed list of which states/provinces/districts/cantons got assigned to which language, feel free to email me at dw.barry@gmail.com, or study the shapefile in the download below.

Data and code

The data and code can be downloaded here (8.3 MB). I hope I put everything in there that needs to be. I've included a long_list file, which is the full set of cognate classes from IELex (including all the extinct languages, etc.), as well as short_list_final, which is what I used for the maps.

While I used QGIS for most of the shapefile editing, I used R to create the maps, to make it easier for me to get the same map window each time. My usual plotting packing in R is ggplot2, but it is very slow on my computer when I give it a large shapefile to plot. I therefore created some "template" images, showing the regions for each language, their boundaries, and the land and water, with a single ggplot run. For the main map creation, I loaded those templates using the EBImage package and edited the pixels directly, writing back to file (also ggplotting the words; since the latter only exist at a small number of points, it is not so slow). It still took about four hours to generate the 207 large maps, but I don't care to speculate how long it would have taken if I'd ggplotted the polygons for every map separately.

R code follows (and is also in the download file). Firstly, creating the land and water template image (I'm separating these out because that's how they're organised on my computer; in all cases I may well not need one or more packages).

library(rgdal)
library(maptools)
library(sp)
library(ggplot2)
library(grid)
library(plyr)

setwd("C:/IELex/")

world_poly = readOGR("shapefiles", "land_for_IE_2", stringsAsFactors=FALSE)
world_poly = world_poly[ , which(names(world_poly)=="NAME")]

# Stuff needed to ggplot the polygons:
world_poly@data$id = rownames(world_poly@data)
world.points = fortify(world_poly, NAME="id")
world.df = join(world.points, world_poly@data, by="id")

map_plot = ggplot() + geom_polygon(data=world.points, aes(x=long, y=lat, group=group, fill=1)) +
  coord_map(xlim=c(-26, 98), ylim=c(5, 72), projection="cylequalarea", parameters=40) + 
  theme(panel.background = element_rect(fill="#87ceeb")) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  guides(fill=FALSE) + scale_fill_gradientn(colours=c("#CCCCCC", "#CCCCCC"), space="rgb")

gt = ggplot_gtable(ggplot_build(map_plot))
ge = subset(gt$layout, name == "panel")

png(filename="land_background_large.png", height=1026, width=1500, units="px")
grid.draw(gt[ge$t:ge$b, ge$l:ge$r])
dev.off()

Secondly, the same for the language regions and boundaries.

library(rgdal)
library(maptools)
library(sp)
library(ggplot2)
library(grid)
library(plyr)

setwd("C:/IELex/")

IE_poly = readOGR("shapefiles", "IE_languages", stringsAsFactors=FALSE)
num_languages = length(IE_poly$Language)

# Stuff needed to ggplot the polygons:
IE_poly@data$id = rownames(IE_poly@data)
IE.points = fortify(IE_poly, Language="id")
IE.df = join(IE.points, IE_poly@data, by="id")

# You and I are both embarrassed by the following copy-paste from Excel.
colour_vals = c("#040404",
                "#080808",
                "#0C0C0C",
                "#101010",
                "#141414",
                "#181818",
                "#1C1C1C",
                "#202020",
                "#242424",
                "#282828",
                "#2C2C2C",
                "#303030",
                "#343434",
                "#383838",
                "#3C3C3C",
                "#404040",
                "#444444",
                "#484848",
                "#4C4C4C",
                "#505050",
                "#545454",
                "#585858",
                "#5C5C5C",
                "#606060",
                "#646464",
                "#686868",
                "#6C6C6C",
                "#707070",
                "#747474",
                "#787878",
                "#7C7C7C",
                "#808080",
                "#848484",
                "#888888",
                "#8C8C8C",
                "#909090",
                "#949494",
                "#989898",
                "#9C9C9C",
                "#A0A0A0",
                "#A4A4A4",
                "#A8A8A8",
                "#ACACAC",
                "#B0B0B0",
                "#B4B4B4",
                "#B8B8B8",
                "#BCBCBC",
                "#C0C0C0",
                "#C4C4C4",
                "#C8C8C8",
                "#CCCCCC",
                "#D0D0D0",
                "#D4D4D4",
                "#D8D8D8",
                "#DCDCDC")

IE.points$fill_factor = IE.points$id
single_digits = which(nchar(IE.points$fill_factor)==1)
IE.points$fill_factor[single_digits] = sprintf("0%s", IE.points$fill_factor[single_digits])

map_plot = ggplot() + geom_polygon(data=IE.points, aes(x=long, y=lat, group=group, fill=factor(fill_factor))) +
  coord_map(xlim=c(-26, 98), ylim=c(5, 72), projection="cylequalarea", parameters=40) + 
  theme(panel.background = element_blank()) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  guides(fill=FALSE) + scale_fill_manual(values=colour_vals)


gt = ggplot_gtable(ggplot_build(map_plot))
ge = subset(gt$layout, name == "panel")

png(filename="ie_lang_areas_large.png", height=1026, width=1500, units="px")
grid.draw(gt[ge$t:ge$b, ge$l:ge$r])
dev.off()


# And now the language borders:

map_plot = ggplot() + geom_path(data=IE_poly, aes(x=long, y=lat, group=group), color="#808080") +
  coord_map(xlim=c(-26, 98), ylim=c(5, 72), projection="cylequalarea", parameters=40) + 
  theme(panel.background = element_blank()) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())


gt = ggplot_gtable(ggplot_build(map_plot))
ge = subset(gt$layout, name == "panel")

png(filename="ie_lang_boundaries_large.png", height=1026, width=1500, units="px")
grid.draw(gt[ge$t:ge$b, ge$l:ge$r])
dev.off()

And finally, the main map-creation script. Perhaps of interest is the way I try to have language families with a consistent colour coding, with enough success to satisfy me; much of the code is just me moving the words around so that they didn't overlap.

library(ggplot2)
library(grid)
library(EBImage)
library(rgdal)

setwd("C:/IELex/")

languages.df = read.csv("language_codes.csv", sep=",", header=TRUE, stringsAsFactors=FALSE)
wordlist.df = read.csv("short_list_final.csv", sep=",", header=TRUE, stringsAsFactors=FALSE)

num_languages = length(languages.df$lang_me)

Encoding(languages.df$lang_ielex) = "UTF-8"
Encoding(wordlist.df$lang_word) = "UTF-8"

# Remove brackets (I think they are borrowed words)
wordlist.df$cognate_class = gsub("\\(|\\)", "", wordlist.df$cognate_class)

# Remove entries with unclear cognate class
wordlist.df = wordlist.df[which(nchar(wordlist.df$cognate_class) < 3), ]

# Replace the IELex language names with the ones I use.
for (ct in 1:num_languages) {
  wordlist.df$language = gsub(languages.df$lang_ielex[ct], languages.df$lang_me[ct], wordlist.df$language)
}

# Fix Luxembourgish separately - the UTF8 character doesn't get through the Regex.
wordlist.df$language[which(grepl("buergesch", wordlist.df$language))] = "Luxembourgish"

# Fix Scots Gaelic separately - the UTF8 character doesn't get through the Regex.
wordlist.df$language[which(grepl("Gaelic", wordlist.df$language))] = "Scottish Gaelic"

# Remove Lahnda:
wordlist.df = wordlist.df[-which(wordlist.df$language == "Lahnda"), ]


# Need this ordering of the languages to make sure the colours match:
IE_poly = readOGR("shapefiles", "IE_languages", stringsAsFactors=FALSE)
ordered_langs = IE_poly$Language

# If we print every language's word, need to know where to print:
label_points = coordinates(IE_poly)
x_label = label_points[ , 1]
y_label = label_points[ , 2]
label_text = rep("", length(x_label))

labels.df = data.frame(x_label, y_label, label_text, rownames=FALSE, stringsAsFactors=FALSE)

# Try to stop the labels overlapping.
labels.df$hjust = 0.5
labels.df$vjust = 0.5
labels.df$angle = 0

right_align = c("Irish", "Portuguese", "Friulian", "Albanian", "Greek", "Bhojpuri", "Danish", "Assamese",
                "Welsh")
left_align = c("Magadhi", "Slovenian", "Russian", "Macedonian", "Romanian")

labels.df$hjust[which(IE_poly$Language %in% right_align)] = 1
labels.df$hjust[which(IE_poly$Language %in% left_align)] = 0


labels.df$y_label[which(IE_poly$Language == "French")] = 48
labels.df$y_label[which(IE_poly$Language == "English")] = 54
labels.df$x_label[which(IE_poly$Language == "Welsh")] = -6
labels.df$y_label[which(IE_poly$Language == "Welsh")] = 50.5
labels.df$x_label[which(IE_poly$Language == "Irish")] = -10.6
labels.df$x_label[which(IE_poly$Language == "Norwegian")] = 8
labels.df$y_label[which(IE_poly$Language == "Norwegian")] = 60.6
labels.df$x_label[which(IE_poly$Language == "Russian")] = 38
labels.df$y_label[which(IE_poly$Language == "Russian")] = 59
labels.df$y_label[which(IE_poly$Language == "German")] = 53.5
labels.df$y_label[which(IE_poly$Language == "Latvian")] = 58
labels.df$y_label[which(IE_poly$Language == "Slovak")] = 48
labels.df$y_label[which(IE_poly$Language == "Portuguese")] = 38
labels.df$y_label[which(IE_poly$Language == "Bulgarian")] = 43
labels.df$y_label[which(IE_poly$Language == "Serbo-Croatian")] = 44.5
labels.df$y_label[which(IE_poly$Language == "Scottish Gaelic")] = 60
labels.df$y_label[which(IE_poly$Language == "Hindi")] = 28
labels.df$y_label[which(IE_poly$Language == "Maithili")] = 26.7
labels.df$y_label[which(IE_poly$Language == "Nepali")] = 30
labels.df$x_label[which(IE_poly$Language == "Assamese")] = 96
labels.df$y_label[which(IE_poly$Language == "Assamese")] = 28

# Words that are too long in Irish to fit in the plot:
irish_60 = c("laugh", "live", "rain")


# Language families for colouring purposes:
north_germanic = c("Swedish", "Danish", "Norwegian", "Icelandic")
west_germanic = c("German", "Dutch", "English", "Luxembourgish")
romance = c("Portuguese", "Spanish", "Catalan", "French", "Friulian", "Italian", "Sardinian", "Romanian")
celtic = c("Welsh", "Irish", "Scottish Gaelic")
baltoslavic = c("Belarusian", "Russian", "Ukrainian", "Czech", "Slovak", "Polish", "Bulgarian", "Macedonian", "Serbo-Croatian", "Slovenian", "Latvian", "Lithuanian")
iranian = c("Zazaki", "Kurdish", "Pashto", "Balochi", "Persian", "Tajik")
indoaryan = c("Hindi", "Urdu", "Bengali", "Punjabi", "Marathi", "Gujarati", "Bhojpuri", "Oriya", "Sindhi", "Sinhala", "Nepali", "Assamese", "Maithili", "Magadhi", "Kashmiri")
other_langs = c("Albanian", "Armenian", "Greek")

language_families = list(indoaryan, romance, iranian, baltoslavic, west_germanic, north_germanic, celtic, other_langs)

# Need to know this ahead of time; I use a power of 2 to make my arithmetic easier:
colour_scale_classes = 32

# Colour wheel:
colours = col2rgb(hcl(h=seq(15, 375, length=(colour_scale_classes+1)), l=65, c=100)[1:colour_scale_classes])

# Indices of columns in the colour matrix for each language family:
indoaryan_i = c(17, 27)
romance_i = c(1, 7)
iranian_i = c(9, 19)
baltoslavic_i = c(25, 3)
west_germanic_i = c(13, 23)
north_germanic_i = c(21, 31)
celtic_i = c(29, 11)
other_langs_i = c(5, 15)

family_colours.df = data.frame(indoaryan_i, romance_i, iranian_i, baltoslavic_i, west_germanic_i, north_germanic_i, celtic_i, other_langs_i)


image_type = "large"

# Change parameters based on the image type.  I originally allowed for a Europe-only or Asia-only map,
# which would necessitate different map_limits, but didn't pursue this.

if (image_type == "small") {
  
  map_limits_x = c(-26, 98)
  map_limits_y = c(5, 72)
  pixels_x = 500
  pixels_y = 342
  output_folder = "images_small"
  background_suffix = ""
  
} else if (image_type == "large") {
  
  map_limits_x = c(-26, 98)
  map_limits_y = c(5, 72)
  pixels_x = 1500
  pixels_y = 1026
  output_folder = "images_large"
  background_suffix = "_large"
  
} else {
  print("Bad image type")
  stop()
}


land_background_file = sprintf("land_background%s.png", background_suffix)
lang_boundaries_file = sprintf("ie_lang_boundaries%s.png", background_suffix)
lang_locations_file = sprintf("ie_lang_areas%s.png", background_suffix)

land_background_img = readImage(land_background_file)

lang_boundaries_img = readImage(lang_boundaries_file)
lang_boundaries_mat = round(as.matrix(lang_boundaries_img[ , , 1])*255)
boundary_pixels = (lang_boundaries_mat < 240)  # Should be 128 or 127

lang_location_img = readImage(lang_locations_file)
lang_location_mat = round(as.matrix(lang_location_img[ , , 1])*255)

# The pixel-coding of the languages in the ie_lang_areas.png map:
lang_pixels = seq(4, 220, by=4)


list_words = unique(wordlist.df$word)

# Main loop:
for (word in list_words) {
  labels.df$label_text = ""
  Encoding(label_text) = "UTF-8"
  
  new_map = round(land_background_img * 255)
  new_map_r = new_map[ , , 1]
  new_map_g = new_map[ , , 2]
  new_map_b = new_map[ , , 3]
  
  
  word.df = wordlist.df[which(wordlist.df$word == word), ]
  
  num_languages_defined = length(word.df$language)
  
  # We want to have the language colours fairly consistent wherever possible, based on the
  # language families defined earlier.
  
  classes = unique(word.df$cognate_class)
  num_classes = length(classes)
  
  class_counts = sapply(classes, function(x) sum(word.df$cognate_class == x))
  classes = classes[order(class_counts, decreasing=TRUE)]
  
  class_colour_i = 0*class_counts
  
  # Initialise a counter for the number of classes each family has been 
  # assigned to:
  family_counter = sapply(language_families, function(x) 0)
  
  # Vector to store classes that will have random colour columns assigned:
  skipped_classes = numeric(0)
  
  # Try to associate a colour index with each cognate class
  
  for (ct in 1:num_classes) {
    these_langs = word.df$language[which(word.df$cognate_class == classes[ct])]
    family_counts = sapply(language_families, function(x) length(which(x %in% these_langs)))
    this_family = which(family_counts == max(family_counts))[1]
    
    family_counter[this_family] = family_counter[this_family] + 1
    
    if (family_counter[this_family] > 2) {
      # Skip on first run; assign random colour class of those remaining later.
      skipped_classes = c(skipped_classes, ct)
    } else {
      class_colour_i[ct] = family_colours.df[family_counter[this_family], this_family]
    }
  }
  
  # Did we miss any colours?
  num_skipped = length(skipped_classes)
  if (num_skipped > 0) {
    for (ct in skipped_classes) {
      remaining_colours_i = setdiff(1:colour_scale_classes, class_colour_i)
      class_colour_i[ct] = sample(remaining_colours_i, 1)
    }
  }
  
  for (ct in 1:num_languages_defined) {
    this_class = word.df$cognate_class[ct]
    class_index = which(classes==this_class)
    colour_index = class_colour_i[class_index]
    
    this_ordered_lang = which(ordered_langs == word.df$language[ct])
    
    labels.df$label_text[this_ordered_lang] = word.df$lang_word[ct]
    
    if (is.na(this_class)) {
      # Do nothing.
    } else {
      # Replace the relevant pixels in the ie_lang_areas.png file with
      # the colour class that it is a part of for this word:
      
      this_pixels = (lang_location_mat == lang_pixels[this_ordered_lang])*1
      new_map_r = new_map_r + this_pixels*(colours[1, colour_index] - new_map_r)
      new_map_g = new_map_g + this_pixels*(colours[2, colour_index] - new_map_g)
      new_map_b = new_map_b + this_pixels*(colours[3, colour_index] - new_map_b)
    }
  }
    
  # Add language boundaries:
  new_map_r = new_map_r + boundary_pixels*(128 - new_map_r)
  new_map_g = new_map_g + boundary_pixels*(128 - new_map_g)
  new_map_b = new_map_b + boundary_pixels*(128 - new_map_b)
  
  # Add the word(s):
  
  # Some irregular cases where the words are long and I have to shuffle things around....
  orig_labels.df = labels.df
  
  if (word == "because") {
    labels.df$y_label[which(IE_poly$Language == "Hindi")] = 27
  }
  
  if (word %in% c("breathe", "rain")) {
    labels.df$hjust[which(IE_poly$Language == "Kashmiri")] = 0
  }


  if (word == "hunt") {
    labels.df$hjust[which(IE_poly$Language == "Balochi")] = 1
    labels.df$hjust[which(IE_poly$Language == "Kashmiri")] = 0
    labels.df$x_label[which(IE_poly$Language == "Irish")] = -10.1
  }
  
  if (word == "guts") {
    labels.df$hjust[which(IE_poly$Language == "Sardinian")] = 0.6
  }
  
  if (word %in% c("kill", "swim", "throw")) {
    labels.df$hjust[which(IE_poly$Language == "Czech")] = 0
  }
  
  if (word %in% c("rightside", "tooth")) {
    labels.df$y_label[which(IE_poly$Language == "Romanian")] = 47.3
    labels.df$hjust[which(IE_poly$Language == "Romanian")] = 0.3
  }
  
  if (word %in% c("split", "squeeze", "stab", "swell")) {
    labels.df$hjust[which(IE_poly$Language == "Czech")] = 0.3
  }
  
  if (word == "stand") {
    labels.df$hjust[which(IE_poly$Language == "Kashmiri")] = 0.3
    labels.df$hjust[which(IE_poly$Language == "Italian")] = 0.7
    labels.df$vjust[which(IE_poly$Language == "Sardinian")] = 1
  }
  
  if (word %in% irish_60) {
    labels.df$angle[which(IE_poly$Language == "Irish")] = 60
    labels.df$angle[which(IE_poly$Language == "Welsh")] = 60
  }

  
  if (image_type == "small") {
    word_plot = ggplot() + geom_text(aes(x=13, y=24, label=word), size=16) +
      coord_map(xlim=map_limits_x, ylim=map_limits_y, projection="cylequalarea", parameters=40) + 
      theme(panel.background = element_blank()) +
      theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
  } else {
    word_plot = ggplot() + geom_text(data=labels.df, aes(x=x_label, y=y_label, label=label_text,
                                                         hjust=hjust, vjust=vjust, angle=angle), size=10) +
      coord_map(xlim=map_limits_x, ylim=map_limits_y, projection="cylequalarea", parameters=40) + 
      theme(panel.background = element_blank()) +
      theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
  }
  
  gt = ggplot_gtable(ggplot_build(word_plot))
  ge = subset(gt$layout, name == "panel")
  
  png(filename="temp_word.png", height=pixels_y, width=pixels_x, units="px")
  grid.draw(gt[ge$t:ge$b, ge$l:ge$r])
  dev.off()
  
    
  new_img = array(data=0, dim=dim(lang_location_img))
  new_img[ , , 1] = new_map_r
  new_img[ , , 2] = new_map_g
  new_img[ , , 3] = new_map_b
  
  new_img = new_img / 255
  
  word_img = readImage("temp_word.png")
  new_img = new_img*word_img
  
  out_file = sprintf("%s/%s.png", output_folder, word)
  
  new_img = Image(new_img, colormode="Color")
  writeImage(new_img, file=out_file)
  
  # Undo any shuffling of the label positions:
  labels.df = orig_labels.df
}

Home > Maps > Indo-European cognates