User:Maik/Thoughts on an Official Dictionary

From Lojban
< User:Maik
Revision as of 16:30, 18 April 2020 by Robertbaruch (talk | contribs) (→‎The structure of a multilingual wiki-dictionary for Lojban: Specifies ISO 639-2 in keeping with the example entry down below.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Here are some of my thoughts on the LFK's goal of creating an Official Dictionary of the Lojban Language (ODLL) [working name; final name yet to be decided].

In this essay I sketch a rough, tentative road map. The plan aims high, but is intended to be implemented in phases. (Rome was not built in day, but it was built--one stone at a time.) This essay is a work in progress; I may edit or enlarge it as time goes on.

I strongly encourage feedback, which you may place on the discussion page, or send to the Lojban list.

Intro: The primary "use cases" of the dictionary

"Use case" is software engineering jargon for a general scenario in which a piece of software is expected to be used and to perform a certain task. Use cases are analyzed in order to help clarify the requirements that a piece of software should be designed to meet in order to be counted as successful.

Borrowing the concept and applying it to this project, what are the important "use cases" for an official Lojban dictionary? In my opinion, one primary use case involves a person trying to write a couple paragraphs in Lojban. We don't know if the writing is an original text or a translation. We don't know the topic. We don't know if the writer is a beginner or an expert ar Lojban. But we still know fairly well what requirements we want the dictionary to meet: We want the writer to be able to find the words needed and to understand how to use them correctly without a lot of difficulty. Similarly, a second primary use case involves a person trying to comprehend or translate what the first person has written. Likewise, we know the requirements: The second person should be able find the words used and grasp their meaning.

These use cases may seem pretty obvious, but I think it's helpful to state them clearly. Other use cases are conceivable, but in my opinion, these are the primary ones we should be bear in mind as we decide how to build the dictionary and then build it.

Wiktionary as a model

I am inspired by Wiktionary. Though the effort is hardly complete, for certain editions such as the ones in English (en.wiktionary.org) and a few other languages (e.g. French, fr.wiktionary.org), it usually does a pretty good job for what it was intended to do: provide clear, practical descriptions of words from a wide range of languages for people who understand the native language of that edition. Each edition is intended to eventually contain an entry, in its native language, for virtually every word of every language (considered important by Wiktionary) in the world: some living, some extinct, and even a few constructed.

I feel there is a lot we can learn from Wiktionary.

Side note: Wiktionary actually has a Lojban edition, but unfortunately Lojban words are currently not being covered in the non-Lojban editions. (Whether this exclusion can or should be addressed is not something that directly concerns this essay, but I thought I'd mention it.)

The structure of a multilingual wiki-dictionary for Lojban

Wiktionary is composed of multiple editions, each a separate wiki in its own right existing in a wiki "family", one for each metalanguage (i.e. the "native" language of that edition -- English, French, whatever). Within each of these editions, Wiktionary gives a page for each distinct written word. Every object language (i.e. both "native" and "foreign") that contains that written word, regardless of the pronunciation and meaning in that language, gets an entry for that word on that page. For an example, see the English edition's page for "cat" which has an entry for English "cat" (the English entry, if any, will always be on top of the English edition page), Indonesian "cat" which means paint, Irish "cat" which means cat, Malay "cat", and so on.

Since all the entries in the ODLL will be either to Lojban or from Lojban, we can and should modify this structure somewhat. If we simply followed the Wikitionary approach, then in the English edition, most pages would have a single entry, since few English and Lojban words are written identically, but some words like glare and be would have exactly two entries: one from Lojban to English, and one from English to Lojban. I think that we should simply have one entry per page. The easiest way to implement a one-entry-per-page format would be to divide the English edition into two editions, one for each direction. I have not set up a Mediawiki server yet, but I believe this is doable. The wikis would be identified in their URLs by the formulas jbo-X (Lojban to X) and X-jbo (X to Lojban), where X is an ISO 639-2 language code (eng for English, fra for French, etc.) using the terminology code when a language has both bibliographic (a code based on the English word for the language) and terminology codes (a code based on the native word for the language).

So in effect the ODLL would be a wiki family ultimately with an odd number of editions, two for each non-Lojban language we care to cover (eventually cover, assuming we get volunteers with the time and the right language skills), plus one for Lojban-to-Lojban definitions, which may eventually be the most important edition of all.

As a minor note, the jbo-X editions in various languages can be usefully furnished with interlanguage wiki links in the manner of Wiktionary and Wikipedia, so a person reading the definition for a Lojban word in one language can quickly take a look at the definition for that Lojban word in another language, particularly English, since the English definitions will probably be the best and most complete for a long time, at least initially. There is little need to interlink the X-jbo editions in the same way since languages usually have the same written word with similar meaning only sporadically.

First phase: English edition, core words first

The structure and potential for a multilingual dictionary will be there from the start, but realistically the first phase of the project will be to create an English edition, and I will say little about non-English editions until later.

The parts of the English phase will probably look something like this:

  • Creating Lojban-to-English entries for official cmavo in active use -- these must be good-quality
  • Creating Lojban-to-English entries for official gismu -- these must be good-quality
  • Creating Lojban-to-English entries for important and useful lujvo and fu'ivla and possibly cmevla -- these should be fair-quality
  • Creating English-to-Lojban entries for the English words used in the above areas
  • Creating a printable jbo->en/en->jbo dictionary suitable for publication

The items above are ordered by priority; in general, cmavo and gismu come before other kinds of words and should receive more care, and among cmavo some selma'o will receive more care than others.

The dictionary team could strive to identify a list of "core words" -- perhaps this can be done by scouring various lists of common words or by translating the words that cover, say, 95% or 98% or whatever% of English and Lojban's other source languages. I am not sure about how to do this efficiently, or whether if it can be. But part of the job of the dictionary team should be to decide what the core words are, and tag them using Wiki page categories. These words would the ones that are sure to be included in the printed dictionary. They would also be the first target of non-English dictionary editions.

Getting started

Tentative first steps:

  1. Perhaps the first step in all of this is to make a superlist of all existing data and ideas regarding the lexicon (see #Resources below for a small start), for reference purposes and in some cases to seed the wiki.
  2. The tech subteam (i.e. probably me and User:Aris) works out the data schema and sets up a development Wiki. The tech subteam will probably hash in a lot of data to have a starting point.
  3. I suggest we start with cmavo and ignore gismu for a little while. I want to start with non-controversial selmaho first. That will give us a chance to figure out how to operate on the Wiki and how we want to format things before we wade into deeper waters.
  4. We should attack words one related batch at a time.
  5. We should make metapages on the Wiki for each batch as need be, or use spreadsheets as Ilmen suggests.

More than definitions

Since Wiki-based dictionaries are not limited by space, they can and do contain a lot of information that it's not practical to include in a printed dictionary: etymology (including gismu source words and cmavo mnemonics), examples, usage notes, links to related words such as synonyms, antonyms, coordinate terms, hypernyms, hyponyms, etc. All the work that could be done can't be done at once perhaps, but it can be done gradually over time.

Of course, we must start with the definitions, but adding at least one example for each word used in a particular way should be considered a priority, especially for frequent and basic words. A few helpful pointers will go a long way to aiding the dictionary user.

Collaboration and standards

It's hard to predict how people will bahave, but here are a few thoughts on how I hope the team would collaborate.

The Dictionary Team will seek volunteers willing to collaborate toward the common goal. Under one possible healthy regime, it may be that each team member takes responsibility for a different aspect of creating and refining content. Person A finds good material and "plops it down" on a wiki page. Person B takes the data and formats it more neatly, placing the core content in the database. Person C writes an example sentence. Person D add links to other words. And so on. Alternately, perhaps there is a good deal of duplication of the various kinds of work, and different people do all the work on totally separate words. It is hard to say in advance how the work will be divided, just that the overall hope is collaboration and a good result.

Regardless of how it's divided, all content is the responsibility of the Team as a whole. No single editor, not even a word's coiner, is the sole authority on a word. All contributions may be "mercilessly edited", to use MediaWiki's phrase.

As discussed in the next section, the Team will devise and impose a data structure for some portion of the content. It will also adopt wiki formatting conventions and quality standards of various sorts dealing with creating and displaying content. If it seems advantageous to do so, the team may appoint a Chief Editor to oversee quality, or a Special Editor to expedite the printed version or some other goal of the project.

Comments can be added and objections can be raised by anyone to specific edits on the discussion pages. The Team is expected to solve disputes by seeking consensus. As a last resort, the Team may opt to solve a dispute by vote.

It's important that the content of the dictionary is consistent with other official Lojban materials, such as the CLL. Therefore, all definitions are ultimately subject to review by the LFK and may be altered if the LFK discovers an inconsistency.

Data structure and support for other apps

Wiktionary's content is classified as "semistructured" data. In order to support the printed version, as well as to support external apps and other uses, key fields of ODLL will be structured using the WikiLexeme module, so that it will be readily exportable.

Some definitions may have to be adjusted in such as a way to make them suitable for the printed dictionary, or possibly eventually for a beginners' word list or a textbook. It may be eventually to add special fields for these other materials.

The exact way all this is to be done is yet to be determined.

Develop dictionary content, not dictionary software

This may be feasible or not feasible, but to the greatest extent we can, I would like to avoid getting bogged down in writing, debugging, and maintaining code or employing "clever" solutions. If possible, everything done within the Dictionary Project will be accomplished with turnkey solutions from MediaWiki and stable extensions. Of course, MediaWiki needs to be properly learned and configured by the Team, and there is some sort of data schema yet to be designed. And some tasks, such as generating a printable PDF, may involve some coding. But generally I see coding stuff as a distraction from the main mission and something to be avoided except as a last resort.

In short, bearing our "primary uses cases" in mind, I would prefer to aim to make this a lexicographical project, and not a computer science class assignment, and I hope it's possible to keep it that way.

The ODLL will be autonomous

In a similar vein, while a reasonable effort will be made to support external apps and other uses with exportable data, the Dictionary Project is hereby declared autonomous from all other projects and entities (save the LFK) and hereby disclaims any implied warranty of fitness for a particular purpose, or anything like that.

Jbovlaste: Comparison and relationship

Jbovlaste (JVS) is an established Lojban dictionary project that is set up differently than what is envisioned in this essay. In short, it is essentially set up to accept arbitrary new additions to the lexicon from independent contributors. Being decentralized and lacking editorial oversight, JVS in my view is not currently constituted to properly edit a dictionary containing core vocabulary. While many contributions are valuable additions to the Lojban lexicon, quite a few are useful only for joking around or unofficial experimentation (both of which are of marginal value from the perspective of our "primary uses cases" -- see intro section). Moreover, the cmavo and gismu definitions so badly in need in of elaboration and improvement are basically frozen under the pseudocontributor account "OfficialData".

In this light, I think the ODLL and JVS will end up in a complementary relationship -- The ODLL will be official, collaborative and centralized and will aim to produce good-quality definitions for core vocabulary such as basic cmavo, and JVS will be an unofficial haven for tinkerers and researchers largely acting independently and will continue to amass specialized, experimental and occasionally offbeat content. Depending on how radically the ODLL updates the core words, it may turn out that the JVS starts to become significantly out-of-date in that area, unless an effort is made to import more current data. As the ODLL reaches the stage of building up its repertoire of lujvo and fu'ivla (which will be borrowed most freely from JVS data whenever it makes sense to do so) and creating solid, practical entries for those, the JVS may conceivably begin to become out-of-date in that area as well.

The language change/reform question

This essay is focused on getting the language into a wiki-dictionary; it's not focused on getting reforms into the language. But the proposals for reform are out there and will certainly affect the work when the "rubber meets the road", so I will make some brief comments.

Regarding the word "change", which is sometimes conflated with "reform", I will mention, in order to be clear, that I do not consider the mere fleshing out of the skeletal definitions in cmavo.txt and gismu.txt to be "changes" in the sense of being reforms. Nor do I consider xorlo-related adjustments to these definitions, however substantial they may turn out to be, to be "changes" in this sense; xorlo has been official for years. I also do not consider clarifying points of confusion and removing contradictions to be "changes" in the sense of reform. These sorts of changes are not language reforms; they are just the natural and expected ironing out of wrinkles in the fabric of logic, consistency and practicality.

Genuine language-reform proposals obviously exist, and will have to be addressed. The Team and the LFK will simply have to figure out how address them. That's beyond the scope of this essay, but at a guess I feel it will involve discussion, studying the history, reviewing usage, and consulting the Lojban community.

In order to get the Dictionary Project under way and moving forward, I suggest we keep lists of what words are "trouble-free", "moderately troublesome" and "very troublesome", or something like that. We will do cmavo first and then gismu. To build momentum, we create entries for the trouble-free cmavo first, then move on to the next category, and keep moving on from simpler to harder.

Engaging the Lojban community

Efforts should made to reach the Lojban community however we can in order to solicit input and reality-check our decisions. We could even try to set up a Dictionary Project Community Advisory Board or something like that, which would not be charged with any work other than giving opinions when asked questions. Alternately, perhaps it's easier just to visit #ckule and other channels and ask questions, and then document the answers. Or ask the questions on the Lojban-beginners list. I think it's imperative to make some effort to guage the community.

A printable English/Lojban dictionary

As mentioned, a printable dictionary is a goal of phase I. I expect that the quality of this dictionary would be worthy of assigning an ISBN and published as desired. The wiki-dictionary database will contain the data to be inserted into the body. The beginning and ending parts will need to be written up. We will want to learn and follow the typical professional conventions for a printed dictionary, whatever they are.

Since dictionaries are constantly being refined and enlarged, perhaps it will make sense to publish a new edition of the dictionary every few years.

Beyond English: A long-term plan

To be honest, nothing beyond the English edition is currently on the radar for me in any significant way, but the plan does provide for future phases.

Phase I: English

This includes all the work discussed so far culminating in both a jbo-en/en-jbo wiki dictionary and a printable version.

Phase II: Lojban-to-Lojban

Based on the work of Phase I, at some point Lojbanists should create a good-quality dictionary of Lojban in the Lojban language. This dictionary will allow motivated learners to bootstrap their way into using Lojban without knowing English.

As a long term goal, ideally, the in-Lojban dictionary would first become equal to the English dictionary, and then eventually perhaps become autonomous and supersede the English dictionary.

Phase III: Other languages

Once the English and Lojban dictionaries are in good shape, the project will be in the position to create dictionaries in other languages, using the careful work already done as a resource.

Conclusion: Spiral development and perseverance

We will do the best we can.

We will improve as we go along.

If we make a decision that we feel we need to change later, we will reserve the right to do so. Obviously it's not ideal to do a lot of flip-flopping, but I also don't think it's realistic to expect a linear trajectory toward perfection. I suspect it'll look more like a spiral trajectory toward some sort of satisfactory and stable condition.

If we persevere in making progress the Project will eventually succeed.

Resources

You are free to update this section with additional resources. Please do not modify other sections but rather place your comments in a new topic section on the discussion page.

Lexicon and dictionary generally

  • dictionary This page is a summary of information about the Lojban Dictionary: Its history, current form, and future design. (historical overview compiled by la Gleki)
  • Ilmen's list of issues for the LFK A general list of stuff.
  • Ilmen's example of two-way dictionary entries:

※ ◆ introduces examples.

※ eng→jbo:

  • "where" /wɛə/ [interrogative adverb] 1. bu'u ma ◆ where did you meet him? → bu'u ma do ra pu penmi
  • "where" /wɛə/ [interrogative pronoun] 1. ma ◆ where did you go? → do pu klama ma ◆ where do you come from? → do klama fi ma
  • "where" /wɛə/ [relative pronoun] 1. (introducing restrictive information) → poi bu'u ke'a ◆ This is the town where I was born. → ti me lo tcadu poi bu'u ke'a mi jbena / ti tcadu je poi'i bu'u ke'a mi jbena ‖ 2. (introducing an incidental comment) → noi bu'u ke'a
  • "without" /wɪθˈaʊt/, /wɪðˈaʊt/ [preposition] 1. se pi'o nai ◆ eating without a fork → citka se pi'o nai lo forca ‖ 2. fau nai ‖ 3. ka'ai nai ……

※ jbo→eng:

  • ma [KOhA] what? OR who? OR where? (as a pronoun) ◆ do viska ma → what are you seeing? ◆ do pu penmi ma → who did you meet? ◆ do klama ma → where are you going? | ca ma → when? ◆ ca ma do cliva → when do you leave? | bu'u ma → where? | mu'i ma, ki'u ma → why? ◆ mu'i ma do pu cliva → why did you leave? | ⇒ mo
  • mei [MEI] -some — re mei → to be twosome, to be two ◆ mi'a ci mei → we are three | xo mei → how many? ……
  • ma [KOhA] what? | ca ma → when? | bu'u ma → where? ……

Cmavo

  • BPFK_Sections Links to work done on cmavo definitions by the BPFK, which will be a key resource for the Project.

Gismu