Publishing the Liddell & Scott Lexicon via CITE


The LSJ Lexicon

A standard reference for any English speaker trying to read ancient Greek is the “LSJ,”:

A Greek-English Lexicon, Henry George Liddell, Robert Scott, revised and augmented throughout by Sir Henry Stuart Jones with the assistance of Roderick McKenzie (Oxford: Clarendon Press. 1940)

Thanks to The Perseus Project, The National Endowment for the Humanities, and many editors1 we long ago got this massive dictionary of ancient Greek into a machine-readable format. Thanks to Giuseppe Celano we have a version of those XML files with the (formerly) beta-code Greek transformed to Unicode.

The LSJ consists of 116,502 entries, each a scholarly article on philology and linguistics. Its history is fascinating, and it is preserved in the Preface to the 1940 edition, which you can read in the Perseus XML edition of the Alpha volume.

Thanks to Perseus’ foresight in openly licensing the electronic text of LSJ there are a number of excellent website and mobile apps offering access to this data.

Yet Another LSJ Dataset!

We, Christopher Blackwell and Neel Smith, have published another version of the LSJ, as a CITE Collection, with the data for each LSJ entry formatted in Markdown.

The data is on GitHub.


The CITE Architecture provides machine-actionable canonical citation for objects of study. A lexicon is, generically, an ordered collection of entries, and is thus easily citable.

History has shown that an important lexicon like the LSJ will exist in many versions. CITE lets us cite to objects in specific versions, or as notional objects, preserving scholarly identity. So an LSJ as a CITE collection is easily integrated into other scholarship.

Why Markdown?

The Markdown version of the LSJ is a transformation of the TEI-XML version. The TEI-XML version has elaborate markup, documenting many features of this complex text. There are 59,586 tagged elements in the TEI-XML version of the LSJ, with 64 distinct tags.

The Markdown version identifies only (1) a URN identifier, (2) a lemma, and (3) an entry, with English translations of Greek in bold, and a few other elements marked by italics.

Why reduce 64 elements to 5?

A Greek-English Lexicon was written, printed, and published as a text for technically proficient human readers. The markup added subsequently was an attempt to transform this 19th Century text into a 21st Century database of Greek linguistics and philology.

It cannot be done.

The abbreviations, the assumptions, the structure of the 19th Century LSJ amount to lossy compression that cannot be expanded to useful data. By publishing a Markdown edition as as CITE collection, we think we are being as true to the original lexicon as we can.

This is a canonically citable collection of data-objects that are intended for human readers.

The Data & Service

This version of the LSJ is published as a CEX file, a serialization of a CITE Collection.2

It is exposed through a CITE Microservice that accepts requests via http and returns reponses as json. For an example of a rich dataset served from CEX via a CITE microservice, see the HMT microservice, and this discussion of it.

The Web-App

To provide usefull access to this data, we offer a web-application for accessing LSJ data.

For projects that would like to link directly to specific entries, this application accepts url-parameters.

Some Notes on the Web App

You can browse entries alphabetically via the sidebar on the left of the screen.

You can type beta code without diacritical marks in the “Search Greek” field; the app will suggest words once you’ve typed two characters. The order of suggestions is as follows:

  1. An entry’s lemma is identical to what you typed.
  2. A lemma starts with what you typed.
  3. A lemma includes what you typed.
  4. Some Greek word in an entry includes what you typed.

This might help a little with the age-old problem in Greek: “You can’t look it up unless you already know what it is.”

“Search All Text” will do a tokenized search of the contents of entries. So a search for “cat” will find entries that contain the word “cat”, but will not find “catalogue” or “category”.

“Retrieve by URN” does what it says. If you enter a valid Cite2 URN identifying a passage of this version of the LSJ, it will retrieve it. Each entry’s URN appears in the upper right corner of the entry. The “Retrieve by URN” field will accept range URNs, e.g. urn:cite2:hmt:lsj.markdown:n109766-n109776, since the LSJ is an ordered collection.

If you click on an entry’s URN, it should automatically be copied to your clipboard for pasting in somewhere else.

To link to a particular entry from another web-page, attach ?urn=[some urn] to the URL for the app:

Report bugs in the app by filing issues on its GitHub repository. Report errors in the underlying data by filing issues on its GitHub repository.

  1. Lisa Cerrato, William Merrill, Elli Mylonas, and David Smith are the names attached to the origional XML files

  2. This is the first public-facing publication that takes advantage of discoverable text-property-type extensions. A property in a CITE collection is typed as number, string, Cite2Urn, CtsUrn, or boolean, basic, primitive types. Because individual properties in versioned Collections are citable by URN, CEX lets us document specific properties of type string more broadly, for example, as XML, as GeoJson, or in this case as Markdown. An application can ignore this, and render string properties as plain text, or it can discover that a property is Markdown and do something with it. This application uses the Marked.js library to format LSJ entries.