The Vowel Harmony Calculator

About the Calculator

The harmony calculator on this site is a tool for quantifying vowel co-occurence in a given corpus. The unconditioned harmony calculator requires you to specify two classes of vowels for your corpus. Polysyllabic words that contain vowels of only one class are harmonic. The script determines what percentage of words in the corpus are harmonic, and also calculates the extent to which the percentage of harmonic words exceeds random chance (the harmony index). The script can also account for neutral vowels, which are vowels that do not participate in the harmony system. Monosyllabic words are ignored.

Harmony threshold and harmony index

In order to determine if a certain percentage of harmonic words is evidence of a vowel harmony system, the calculator must determine what percentage of words would be harmonic by chance alone. For example, if 90% of all the vowels in a corpus are in one class and only 10% are in the other, one expects to find many words that contain only vowels from the first class. This is a phenomenon we refer to as "class skewing."

To address this issue, the script calculates a unique harmony threshold for each corpus. This number represents the percentage of words expected to be harmonic purely by chance, taking into account (1) the vowel distribution of the corpus and (2) the average syllable count of polysyllabic words. This number is used to produce the harmony index, which the percentage of harmonic words minus the harmony threshold.

Applications

The unconditioned harmony calculator was initially developed to measure Turkic backness harmony. It can also be used to measure systems such as Uralic backness harmony and Bantu height harmony.

How to Prepare Corpora

The calculator is designed to handle corpora consisting of ASCII text with one word per line (for more information on ASCII, click here). We recommend corpora between 10,000 and 100,000 words. The script works with plain text files produced on Macintosh, Windows and *nix platforms (it automatically handles line-break issues). Corpora should be saved as "text only" from an editor like Notepad or SimpleText. Filenames should not contain any nonalphanumeric characters other than dash and dot.

Non-ASCII characters

In order to deal with vowel symbols not found in the standard ASCII character set, our convention is to represent all ASCII vowels with their lowercase symbol and vowels not in ASCII with specially assigned uppercase vowel symbols. The script does not deal with consonants; only characters that are specified as vowels are taken into account.

Long vowels and diphthongs

Some languages have long vowels. Using two vowel symbols to denote this in a corpus would cause the calculator to count them as two syllables, which would produce invalid results. In order to deal with this, our convention is to replace these long vowels with their initial symbol followed by a colon (e.g. "azuuda" becomes "azu:da"). We then run the script with the "distinguish long and short vowels" option checked and specify ":" as the symbol to denote them. Also, the "reduce diphthongs" option can be used to reduce any string of two vowels to its first vowel..

Neutral vowels

Some harmony systems include "transparent" or "neutral" vowels, which are not part of the harmony system. These vowels should be indicated by checking the "language has neutral vowels" option and specifying which symbols denote neutral vowels.

Results

The unconditioned harmony calculator produces the following results:

mean syllable count
mean syllable count in polysyllabic words
harmony threshold
percentage of harmonic words
harmony index (percentage of harmonic words minus harmony threshold)
harmony threshold in the first two syllables
percentage of harmonic words considering only the first two syllables
percentage of vowels in each class (skewing)

It also produces three logs:

a harmony log, containing details on harmonic word distribution
a disharmony log, listing all disharmonic words
a frequency log, showing vowel frequency and symbol co-occurence tables