I mentioned a few weeks ago that I planned on making I's/J's and U's/V's look the same on the back-end, while preserving their traditional orthographies on the front-end. I've just completed this task!
My main motivation for making this update is because certain passages stored in The Latin Library reflect the older conventions of using J's for consonantal I's or U's for both consonantal and vocalic V's. Numen's parsing engine was having trouble recognizing forms like jecit (iecit) and uuius (vivus). So now as a result -- after a bit of work -- the engine is updated and now recognizes more possibilities than ever. Incidentally, internally J's are stored as I's and U's are stored as V's.
Another project I completed at the same time is an order-of-magnitude speed improvement for parsing. I was trying to figure out ways to make the engine faster and I discovered a shortcut that boosts speed tremendously. When parsing a word, the engine used to spend between 250ms and 500ms parsing each word! That was always disappointing to me, but I had gotten around the problem by caching the results. Now, however, word parsing takes about 25ms!
Why bother improving the speed? Because soon I will be implementing word lists and frequency lists! A word list, of course, is just a "mini-lexicon" that defines only the words in your chosen passage, and a frequency list is a list of words in order of how often they appear in a passage. The word list will be helpful to quickly work on vocabulary for a passage, and a frequency list will help Latin students study more effectively by giving them the most frequent words first. I'm very excited about this feature, but I don't anticipate it will be done before January 10th (giving me the winter holiday to work on it).
That's all for now!
Wednesday, October 14, 2009
Thursday, October 1, 2009
I's and J's and U's and V's
So one problem with Numen is that it doesn't recognize the different possibilities when dealing with I's and J's and U's and V's. As you know, the J and the U were not Classical Latin letters. There has been a lot of back-and-forth over the past 200 years -- some editors prefer the originals and some prefer the modern versions.
But how should Numen deal with this issue? Internally, the computer is more precise and less forgiving than a human, and so in order to provide highly sensitive and accurate searches, the data needs to be "normalized". For example, I recently normalized verbs for consistency by changing all deponent verbs into their active forms and simply marking them as deponent with a data flag. Now, when you search for a deponent verb, the flashcard still shows something like sequor but internally it's stored as sequo. The reasoning here is simple: deponent verbs, regardless of their dictionary form and traditional morphology, still have active participles and their imperfect/pluperfect subjunctives are still formed from active infinitives.
But what about the I's and J's? Those are easy. Convert all the J's to I's, and most Latin readers won't have a problem -- this has been the convention for quite some time now. But then what about the V's and U's? Should I convert all the U's to V's? The opposite is true here: most Latinists would be mildly irritated by this form: uiuus (vivus).
The solution, which would be similar to the one for the deponent problem, would be to mark internally everything with I's and V's but then show the contemporary I's and U's and V's to the end users. That way, the computer can do accurate searches, but users get the information they are used to.
So, in the coming weeks, Numen will undergo this under-the-hood transformation. For the most part, users will never even notice -- except in one area. Searching for uiuus will be the same as searching for vivus!
Subscribe to:
Posts (Atom)