Unicode Conference 36: Day 2

These are live blog notes from the 2012 Unicode Conference in Santa Clara, California, 23rd October 2012.

Usual disclaimer for live blogging: These are informal notes taken by Dave Crossland at the event, and may or may not be similar to what was said by the people who spoke on these topics. Probably if something here is incorrect it is because Dave mistyped it or misunderstood, and if anyone wants corrections, they should email him immediately (dave@understandingfonts.com) – or post a comment.

Vint Cerf: The Bit Rot Dilemma

My purpose today is tangentially related to what you are about – encoding writing in digital form for the whole world. But encoding in digital form is what I work on. When you think about digitisation and representations, think about the issues of doing so over VERY LONG periods of time.

Unicode is key to the DNS; since 2003, efforts to express non latin characters in domain names. Now ICANN opened up TLD space, they had 2,000 requests for new TLDs, 1,900 unique, but few were outside ASCII. I don’t know why that is. A 40 year history?

In DNS, the expressiveness of labels is not what is at stake. Its not expressing every way of writing in a domain name label. We dont want a novel in domains, we want identifies that can be matched correctly: MATCHING is the point.

In unicode there is more than one way of writing the same thing; things can LOOK the same if they codes are different. So someone can type something in and not get what they expected.

We aren’t able to put unicode into DNS directly, we encode as acsii. punycode and its odd apperance – .xn--90ae – is unambiguous.

What you do is hard and important, so thank you!

Now, my primary topic:

Every day we create complex digital objects with applications. Spreadsheets, text documents, even more complex data structures; 3D interactive objects, interactive environments. The bag of bits can become rotten if the runtime for it isn’t available.

In 3,000AD can you open a 1997 Powerpoint file? Can you open a LibreOffice presentation file?

I had this last week, opening a 1997 powerpoint file on a Mac failed. We invest time in making digital objects and we would like them to be preserved.

Digital photographs suffer the same thing; some contemporary photo managers no longer support GIF files. Its not just the application code; its that app runs in a particular operating system. If you dont have the app and the os — and then a specific piece of hardware with IO interfaces – then you may loose the ability to access your data.

Each year the National Archive get hard disks they are meant to curate and catelog. The variety of software running there, its tricky to figure it out.

Why is retaining the ability to interpret digital objects is hard? Business failure, your OS vendor no longer supports some hardware, vice versa. Products advance, and get retired, and no backwards compatibility is made. If you asked for source code, 50% will be in the new version, so they won’t release it.

You can imagine a regime that requires code escrow for published software. When I contract for software development for hire, I insist on a ‘go bust and get escrowed source code’ clause. Loss of access to digital objects is common due to business failure.

Representing the same document in various ways, ROSETTA STONE style, allows extra potential for recovery. Reverse engineering software to interpret the data is more possible there. But many cases this isnt possible, like a video game.

So a standardised object for a particular application, a document or a spreadheet, which can be represented in various applications, would be a good idea.

Virtual Machines may help! You may be able to run an old OS in a VM in the cloud and allow you to run the old applications. Its hard. My first home computer was an Apple ][+ in 1979 and I did my accounting on that, then then ][e with a compatible 5 1/2 floppy drive. Then a Mac had the 6502 chip board to run old software and run that old drive. The MacOS could figure out how to make the old processor think it was accessing the old drive via the Mac’s new IO.

I had a hissy fit when I got a Mac laptop without an optical drive. Another thing to carry!

When application software you NEED and are depending on, but patents prevent you from using it, there’s something wrong with this situation. We should think about legal regimes about what we content creators have something to say about being to retain forever the ability to access those bits.

Unicode was always about rendering writing digitally. I’d like to introduce a meme here:


Some people misunderstood this, like a physical CD. I don’t think that. 1,000 manuscripts can still be read in greek or arabic; vellum is highly preservable. Its physical, so if you can see, you can access it.

But I see vellum as having all the digital representations of the digital object; the data, the app, the OS. So if you can see it, you are SURE you can access it.

This maens changes to copyright, patent and other laws to give us the leverage to give us the ability to interpret the work we create.


Domain names and URLs.

Something odd happens when you monetize things. At the start it wasn’t monetized; John Postell (?) was the original registrar, funded by the US government, and he passed in 1998 just before ICANN was fonded. When the NSF realised it was spending millions a year to operate DNS, and asked if we should spend research dollars on a private sector interest activity, they decided Network Solutions (under contract to NSF) to charge for domain registrations. I thought it was reasonable, it wasnt clear what to charge, $50 a year for 2 years, seemed minimal.

When you introduce this, and a need to renew to create a continuing revenue stream, you create instability. When you cease to pay, what happens? The domain name sinks, or returns with differnet content. You can be mis attributed to that content.

For me it was a revelation that the monetization had side effects that are serious for preservation for content. URLs are ways to point at things that are temprary, so we loose our ability to reference information in a stable long term way. Thats unsettling.

Bob Kuhn (?) and I started ‘Digital Object Identifiers’ project. These can never be reused, create ‘permanent’ curated archives that contain these obecjts; you need a business model to support the archive, and you want to have redundant archives. Its not just documents, could be source code, object code, images, movies. The system should be insensitive to the nature of the content, while being able to tell what that nature is.

This is like UNicode, trying to take a digital ID, a codepoint, and have it interpretable as a specific object in a specific class of symbols. We need to classify digital objects similarly, so we know all the required dependencies for accessing an object.

When you put something in you want to be sure its not modified when its removed. A double digital signature can do this. You sign it with two keys and then it cant be modified by a single party. Digital signatures can become weak over time so you may need to resign things over a long time. I have to recork my wine in my cellar in the same way.

I’m not sure how to advance this problem of long bit rot, I think by 2,100 AD we will have lost access to everything we access today, and our colleagues in the future will not know much about what happened this year because the bits will have rotten.

I think unicode should not be disengaged from this. I’d like you to keep this in mind, as I know you are interested in preserving digital culture.

Thank you!

Q: I’m from the W3C and I wonder that XML can be useful for marking up data and metadata. Is there a way to reencode data in a more resilient way?

A: ASN-1 is similar. Its not for operational purposes; running an application with a markup version, but it might be a way of documenting things, yes. PDF-A is a reasonable attempt at similar. While readers are free, creators are not, and if ADobe goes bust (which I hope it wont) then we may loose the ability to access that format. HTML5 might achieve similar things. This relates to my rosetta stone idea. I’m hesitant to leap on this because there are semantics to deal with. Sometimes thats only achievable with executable bits. You may need the applicatoin to know why a bit is where it is and what it means.

Q: Do we really need to keep all that stuff?

A: Martin! You remind me of this young PUNK – no pejorative, its a compliment – of a young guy talking to librarians. “The important stuff will be important and converted to new formats” – but the librarians were on the ceiling, and ‘A Team of Rivals’ book about American History is a key example of this: WE DONT KNOW WHAT IS AND ISNT IMPORTANT. The author wrote that book in the present tense. She went to 92 libraries to read letters exchanged by people contemporary, and reconstructed conversations based on their language. Those letters were personal, not important, just about daily events.

Q: A patch to a program can change the way a document appeared. How much accuracy is enough?

A: We’ll have a spectrum of capability. Mapping all text to unicode, leaving that as the ‘trace’ of a document, then try for more and more formatting. For documents. But our objects are not just documents that COULD exist in print. Many things CAN NOT exist in print, sound, video, and INTERACTIVE things. A spreadsheet can’t exist in print except as a momentary view of a state in time. Not the structures and formulas. There isn’t perfect fidelity possible. How to grade the levels of fidelity?

Q: Adobe Reader is likelyt o be around a long time (I work for adobe) but this talk of preservation seems rather gloomy to me…

A: The ADboe founders are friends of mine and I respect what they’ve done. Flash is a differnt matter 😉 I’m afraid that it alone isnt enough, we need a conscious effort to preserve peripheral data formats. PDF is great but its not enough because there are so many KINDS of objects.

Q: I’d like to preserve my family tree, a tiny small group of users. There are things important to 100,000s and important to just 10.

A: Whats worth preserving is a key question, and if something is important to a fmaily, it should be preserved byt hat family. This is why I think a legal regime . Paul Allen’s (not MS P.A.) ancestry.com who works are gallup now, his digital obejcts are complex. … My email from 15 years ago isn’t readable; I used a PC based email application, I moved hwardware, and I kept all the files, but the STRCTURE is lost; the attachments are gone. Its but a shadow of what I had. Its a hazard.

Q: Is analog part of this?

A: Yes, rendering is the end part, and analog is the final part of rendering. Did you know we can look at a music record with a laser to digitize the music even though its damaged so much it cant be played any more? Amazing.

Q: Richard brought up a fair point about XML and markup technologies. Semantics is a key issue, CLDR, our data is all in XML, but we have a 100 page spec on how to interpret it. the XML helps to be self descriptive but without the semantics, its opaque.

A: I was in UCLA yesterday, celebrating the Turing award. Alan Kay was speaking there. I challenged the CS community, “Where’s the science in CS?” in an op ed page. He said CS isnt about software, programming, or even computation; its about PROCESSES and the way they behave. If we are going to preserve the meaning of complex objects, we must preserve the PROCESS of how to use them. So this is a fairly big challenge; I’m not sure what the right way is to go about it. If we don’t think about the Intellectual Property regimes that are friendly or unfriendly to archivists, we may lose before we begin.

Q: We need something like acid free archival quality paper. That stamp that people can gravitate toward for long term storage. That will get traction from ordinary people.

A: Hmm. Literally or metaphorically?

Q: Both

A: That acid free paper idea, its a metaphor for people to understand why this is important. I agree.

Q: A lot of us in the last few years have abdicated this topic to a large corporation.

A: Yes, I’m sorry to hear you’re using hotmail 😉 Its a reasonable concern. We allow MS, Google and Yahoo and that leaves us open to a problem. I have 75Gb of email at Google. I am worried about that too, for the long term. New digital storage may be our friends, we can store that much in a small space. I wonder about file transfer and serach; a terabyte in a sugar cube disk without enough IO to process it fast.

Q: We dont know what will be important. With easily available encryption we also loose the ability to get things back. Should personal certs expire, or an escrow so we can access files 50 years after someone dies?

A: Wow, yes. Protecting privacy in the near term, protecting history in teh long term. I expect today’s crypto will be broken in the long future. CCs have a expiry date, so the numbers churn so they can’t be used to do bad things. If you intend for things to be available, you want somewhere to put it for the long term. The protection needs to be undo-able. Wow, I would like to work on that problem.

Q: I’m working on PDFJS, and one goal is to make HTML possible to render antyhing in PDF. What do you think about such ‘super formats’ that can encapsulate other formats?

A: The union of everything rather than the itnersection of everythig. But eveything can be contradictory. I dont know if its possible; I suspect there is a Turing Halting Problem in there somewhere. There’s no way to prove that it will. But I’m all for generality. As a former comptuer programmer, I’m all for trying to make the most general solution to the problem, but its hard so we’ll have several cases. I like this javacsript idea though!

I met up with Tom Milo who showed me


Demonstration of ACE with SVG 🙂

New HarfBuzz Coming to a Device Near You – Behdad Esfahbod, Software Engineer, Google, Inc.



Slides at http://goo.gl/2wSRu

I spoke about Harfbuzz 2 years ago at UC34 and its not in a much better SHAPE haha

Old Harfbuzz vs New Harfbuzz

Who already knows about HB? 50% of room.

This project started 12 years ago when GNU+Linux started adopting Unicode. FreeType started interpreting OpenType layout tables and then removed that, Gnome’s PANGO and Qt did their own. I maintained Pango but I hated it.

Christmas 2007 I started a new shaping engine, thats the old one. At 2009 I started on it full time, and I completed it at Google. Thats the new HB engine.

Its a library with one function:


If you take arabic, thai, india, mongolian, vertical japanese… the text needs contextual rearranging.

There is a generic shaper, a fallback ‘dumb’ shaper that does nothing but stack glyphs, then backends for Uniscirbe, Graphite, CoreText, and ICU LayoutEngine. You should be able to use the harfbuzz API but if you want to use a platform shaper you can do.

It was designed from the start to be usable by humans; others are designed around font formats, exposing all font features, and are not user friendly.



Robustness is crucial for web fonts, Firefox has really helped drive this. On a desktop i could crash out of memory. error handlind, what to do if the font is broken, what can you do? rendering something close to what you want. you cant query for font errors, we just return the best possible output. for C programmers, we use techniques common in new free software libraries, you dont have to NULL check everything. Thread safe, we have a desktop to thumbnail icons, icons are SVG, SVG has text, pango wasnt thread safe, we crashed. Not good. If you design Indic fonts, the Devanagari, Tamil, Sinhala shapers are all different. The new indic shapers copy and modify the previous shapers but dont modify them. So I had to deal with bugs, like force RTL on a font it would crash. So I wanted a unified shaper code body.

For humans, if you have no experience with text rendering or opentype, I can tell you, put your face data, your font data, your text buffer, and shape! Thats it.

Flexible: Glib, ICU, UCDN. Unicode character database is used. You can use HB with FreeType, Uniscribe or your own call bakcs. There are NO dependencies.

Efficient. Instead of parsing OT tables 2 butes at a time, i could MMAP the table directly. no malloc(), just mmap() and sanatize.


Been in Firefox since 4.0, GNOME since 2012-09

Chromium/WebKit patch landed 10 days ago

ICU LayoutEngine drop in replacement is ready

Q: This means OpenOffice, XeTeX and OpenJDK can use Harfbuzz.

A: Right, if we do that, then I can say, if you’re not Apple MS or Adobe then you’re using Harfbuzz 🙂

Android and KDE are using hb-old but we plan to switch them soom

A lot of embedded companies want to use this.

With Emscripten you can do cross compiling to Javascript!



Fallback Shaping: we can compose mark positioning from base glyphs automatically. Here is a correctly composed glyph, here is the GDEF table dropped, GPOS dropped, GSUB dropped, and all 3 dropped.


My focus last few month. I can write code and ask peo

opentype is NOT a standard, its just RECOMMENDATIONS. Uniscribe is the standard. So we decided to be as close to Uniscribe as possible. So I got text from 60 wikipedias, and tested each word against uniscribe, and aimed for a 1 in a million error rate (1e-6) and its currently less than 0.1% (one in a thousand). So for 100,000 works we disagree on about 50 of them between Uniscribe and Harfbuzz.


There is new Myanmar shaping in Windows 8.

Q: Normalisation, its the same code point or its not. Shaping, its the same pixel in one resolution and another pixel at another res.

A: We test glyph ID and positions, not images.

Q: GID is a number easy to match. Positions?

A: We test in font design units, 2048 or 1000 em size, and i render at that many pixels. I’m not itnerested in hinting, I’m interested in GSUB, GPOS, and shaping engine. We think we are outmatching Uniscribe in some parts. Our indic shaper has a uniscribe bug compatibilty mode 🙂

Pravin: Excellent job on the test suite. Windows fonts working well on new harfbuzz. We have fonts made for Pango too, so can you test the restuls of pango and harfbuzz? If we can get test cases we can fix in the fonts.

A: Very valid point. If you look at GNU+Linux side, we designed libre fonts against the old shaper. but now we match uniscirbe. In sinhala, it shows up badly; there is no OT spec for it but Uniscribe implemented it. So people implemented what made sense to them in pango. Uniscribe many some very different decisions. So now I implemented those decisions and this breaks those fonts. I deprioritised this because it wasnt clear where bugs were in hb-old, the shaper or the font? But im happy to work with anyone on font issues.

Q: it seems useful to compare against pango for sure. ICU, we get bugs on indic scripts in the form, heres the parallel patch in pango, and perhaps a commit to harfbuzz. so the more we see convergence here the better.

A: Yes

Mark Davis: What challenges were there in plugging ICU and others?

A: It was easy, no major mismatches. 1, the callback to load a table didnt include the length of the table. no way other than crashing. also concept of a layout engine of ICU includes a font and font size. in HB, i have a ‘shape plan’ that takes a script lanauge direction and font size. so there is a mismatch in what we can cache.

Alolita: I’d love to see your unit tests, we are building something similar.

A: Good point. I considered publishing this with Wikipedia, its http://code.google.com/p/harfbuzz-testing-wikipedia and I will meet up with J Kew in Vancouver in a few weeks

Innovations in Internationalization at Google – Luke Swartz, Product Manager, Google Inc. and Mark Davis, Sr. Internationalization Architect, Google Inc.

We do core i18n, encoding work, cldr, icu heavily; we do our own segmentation, character encoding detection… and beyond that we look at entities, names, phone numbers…

We’re trying to NOT overlap with out Google presentations; harfbuzz, ICU and javascript, plurals and genders talks today. Tomorrow, emoji, locale data, bidi and rtl, i18n testing, and ldml.

Other issues?

Entities. Knoweldge panel in Google search has these. If you search ‘brad pitt’ we connect that name to an entity, thats locale independent. to put together theis kolwedge graph, what entities are and the relations of entities, we use Wikipedia, freebase, and the web. web with a huge corpus is easier (english, spanish) than say slovienian. so we can translate the entities relationships, even tho eg brad pitt and jennifer aniston have different names in slovenia.

Names. ‘Barack “barry” obama’? Google+ Pages “Lindt Chocolate World” and google.com/+toyota

Unicode security issues. well knwon mixed scripts used for spoofing, the ‘paypal’ problem with a cyrillic a. mixed numbering systems, a U+09EA is a bengali 4 that looks like latin U+0038 (“8”). We allow differnt script in your nick names.

You can enter 100s of combining marks and it looks like the software is damaged when they spill down the page.

Name Formatting

The reversal of given and family name is one of the biggest differences in the world.

Phone Number Formatting. This is in Android ICS

Addresses, equally compelx.

Smaller languages are getting bigger, and Google products now aim to support 60 langauges. Google prevoiusly only supported Hindi in India.

Google Translate is also expanding from 40 to 65, the last one way Lao.

They added Yiddish, and because they could do english to Hebrew and German, they could convert that to yiddish, and english can now conduit eg lao and yiddish.

So now you can talk to your room mate in NYC 😉

Google Speech recognidition covers 42 languages with 46 accents/dialects

We’re also working on Fonts in our group.

Encoding is Unicde, but we’re working on a pan unicode font project, Noto.


We see these boxes of unknown fonts, ‘tofu’ and we want to eliminate this, so “Noto” for ‘No Tofu’.

It will be libre licensed so no user will ever need to endure tofu. They are harmonised too, they should flow nicely, not a ransom note.

we will also have UI versions with different virtual metrics.

also important with fonts is knowning what is in a font; there are specialised proprietary tools for analysing fonts, it was hard to do with libre software.


we use this to read font data, it was black magic that only font developers knew, but now everyone can do it with this library


http://translate.google.com/toolkit is used for ALL google’s own localisation. this hooks into machine translation. ARB is a format for web app l16n


If you own a video on youtube, you can add captions and translate them in place with GTT.

Arabic Typography – Tom Milo

Here is a ‘minimal pair’ – 2 lines with 1 difference, ‘as money for a king’ and ‘the money of the king’ – lam-alef fusion and lam-alef ligature.

The connection jumps in logical space. The top one is a ‘fusion’ not a ligature, thats the bottom one. In arabic you must distinguish this; letters in the same block the fuse, and letters in different blocks that touch are liagtures.

So some results: here is the lam alef, anything with a lam on the top row and anything with alef on the bottom. The cells should show a fusion including all the elements described by the unicode. IranNastaleeq should do this for all 42 cells but it only does FOUR. That font has over 4,000 glyphs. These 4 are sufficient for arabic and persian.

Nafees Nastaleeq, a government project. 1,001 glyphs and only 3 successful fusions. This is enough for Urdu, not even arabic.

Alvi Nastaleeq, a Monotype Nastaleeq design, and it has only ONE fusion that works. 20,000 glyphs!

Jameel Noori Nastaleeq is the same typeface but reprogrammed by another team, with 4 fusions; perisan and arabic, not urdu.

Finally, DecoType Nastaleeq has 456 glyphs and covers all 42! There is another col with a new unicode point we are yet to support.

This table shows similar results for a word; only a handful of successful fusions for all these OpenType fonts.

Now, I want to take you through an arabic typesetting project. For printing to come to Europe from China or Korea, it must have come through the Arabic world.

The script is found everywhere islam played a role. arabic is an islam related script; ukrainian muslims of tata descent, serbocroats who were muslims used arabic. javanese is written with arabic.

the mechanisation of script associated with islam? arabic typography is a by product of latin typography. did europe invent typography or renovate it? 🙂

There is always the argument that typesetting arabcic is complex with many contextual variations; but latin type at the very start had just as many contextual variants; 22 alphabetical letters with 323 sorts in gutenberg bible. they used them to appear less mechanical. so there was no mechanical hurdle.

the idae of designing type came later; initially it was an accurate model of the hand written scritp. Here are 2 lines from an Ottoman Koran, and here are 2 lines based on digital type. The first is decotype naskh and the second is linotype lateef; typical of ‘eurobic’ fonts made by people who dont understand the script or who dont have technical implementation to work with.

Allographs: the unit of graphic rendering. like allophones in phonetics. arabic typography revovles around letter blocks; single letters or fused letters. ligatures are letter blocks touching. other ligatures are europrean approximations of letter fusions.

Archigrapheme – the unit of arabic script analysis. The only way to read this word is to already know what it is. if the dot is placed above it works, but if the alef is under, it can be pronouced two other ways. so no dots means 3 meanings.

Typography is deep down a reproduction of writing. arabic, you look at paleographic text, proto-typographic writing chosen by early typographers to reproduce.

When you deal with real arabic, you need to identify the style.

There are theographs: words for god.

De-grammaticized arabic was done by workers in the vatican who used syriac script grammar to shape arabic. ‘yod eurobic’ has this inverted Y and ‘noon eurobic’ has an N form. these forms dont exist in arabic.

Practically this means you can see arabic has a DISSIMILATION of repeated letters; if you aramaize arbic you see repeated shapes.

Each script style has a different solution, each is a system with grammar. Eurobic collected fusions in a haphazard way and added them to eurobic typography. OpenType is based on this typographic appraoch.

This table of green, red and blue cols helps you to identify the style.

traditional typography for print was driven by form, eg, LM-LA, two glyphs. but in digital type you want to color each letter individually, or have a cursor step through each letter. the only way to get each letter lined up is to design the tech not from form but from content. that is the approach of Decotype.

Here are 3 letters, keshide – you can add a keshide in either of the 2 interletter positions.

This image shows that there is a euro approach which has nothing to do with the arabic script tradition.

Finally, there are 2 [BIG] books typeset with decotype.

behdad: why

A: the majority of my examples are mechanical; there is mech euro and arabic.

behdad: the syriac conspiracy theory, the yod/noon variants. if you want to cut letters, you pick one and cut it.

A: you are now used to it, its all over the place, but its a historical development i am presenting. there was an effort to develop arabic script; indian the british tried. the persians went to lithography, the only way to get the script right because movbale type didnt work. i said at the start that typography is originally meant to represent the writing as it was in use. thats why it failed in arabic world when it started but took off in europe. its the mass production of newspapers with linotype that used eurobic that forced it down the arabic worlds throat. in the last 20 years, with macOS and others, the software vendors used eurobic, and the style became associated with modernity. so now there is a trend to use computer technoogy for arabic instead of fitting arabic into computer technology.

Q: nastaleeq?

A: europeans never got close to mechanising it and the people there have rejected anything that doesnt get close.

– – –

Polyglots in the mist

What does it mean to say you speak 24 or 36 languges?

its a trade book but a serious book: ‘Babel no more’ – the search for the worlds most polylingual people

It was reviewed nicely in the Times, Economist, The Atlantic, The Times Literary Suppliment, Vocabulary.com said i was ‘the indiana jones of polyglots’ – but it was evaluating claims people make about themselves about this.

hyperpolyglots are a mystery in plain view; linguists just didnt look into this. like the loch ness monster has an office in the biology dept and no one had gone to see them 🙂 This is a kraken, another cryptid.

Tradtiaionlly, a polyglot is a person who knows 6 or more languages. thats the UCL professor Dick Hudson who surveyed multilingualism, and there are communities where EVERYONE spoke 5 languages, so someone with SIX must really be extraordinary.

I revised this upwards a bit to 11 languages 🙂

These were mysteries for science; although there are many stories of latin high school teachers who can speak a dozen languages, no systematic or empirical evaluation. so i went looking for them.

This guy: Cardinal Guiseppe XXXX (1774-1849) who didnt travel outside northern italy but he lived in a time when many europeans were there too who he learned from. as a cardinal he worked as a librarian and in the propaganda fide (?) the evangelical wing of the vatican. that place at that time was a unique place to see the worlds language. it is said he spoke 114 languages and dialect, by his nephew. also 72, a religiously significant number as thats the number at the time of the fall of the tower of babel.

Some scholars said he grasped 60-61 but agreed he had MASTERED 30. he left an archive in bologna. he was a librarian there, one of the worlds oldest libraries; i went to the street he was born on.

this is a box i found in his archives, not described in any of the descriptions of who he was or what he could do.

he was the 19thC premiere language learner. what enabled him to do this? people claimed to know the secret but never told it. Maybe it was this box. it was labeled ‘MISC’ 🙂 he had packets of cards, 1 1/2 x 3 inch slips of paper. something in italian, latin, on one side, on the other side, persian or something.


the mythology of polyglots is that they just absorb it. there isn’t hard, time intensive repetitive work involved. but this is a hint thats not true. i dont know the history of the flash card as educational technology, but i doubt he INVENTED the flash card.

i also wanted to go to a non european place where multilingualism was common and the idea of knowing langauges was different. i went to south india, bangalore and hydrabad, to glance at whats happening there lingustically. even there, where it is very common to speak 5+ languages from a young age, there are people who can speak 12+ langauges are revered and talked about.

I told people about my project and they said ‘wow yes you must meet someone i know of’. i have studied spanish and mandarin and it changed my brain. this is an indian diplomat who knew indian and chinese and russian languages.

before indian i went to mexico, the Hippo Family Club, a japanese language learning club, 35,000 members, groups in korean, mexico, and us. a fmaily orientated play/game space with a motto that EVERYONE can speak 7 languages. they do this through language kareoke. this is for middle middle class who realised their kids had english but they needed MORE than enlgish and spanish for better economic chances.

I also spent time with Alexander Argreyus (?) who moved to Singapore and works at U of Omalan (?) who has a daily linguistic workout. This is in my book trailer.

“I wake up and write 2 pages of arabic, and spend a couple of hours reading arabic and listening to arabic and reading bilingual texts and reading freely” – 9 hours a day with a wife and 2 sons, was 14 hours a day before. Then I spend 2 hours on other languaes in 20 minute chunks. He has a PhD from U of Chicago.

Emil Krebs as another great hyperpolyglot from history.

You see how textually focused he is. He studied many langauges simultaneously. He’s got a few things specific to them. they learned how the learn and build thier own pedagogical environemtns to study in the way they want to study. they learn in a NON INSTITUIONAL way which means they have freedom to develop their own programs for themselves.

Alexander woudlnt tell me how many languages he spaeks, or to be tested in any way. I have to take his word on his abilities.

What are the upper limits? (on ability to learn, speak, and use languages?)

The lifetime limit is 50-60. Oral proficiency in 22 languages, demonstrated by a belgian and a scotsman to me. they compete in polyglot contests in 1987 and 1990 to find the most multilingual belgian and then european. a contest, you had to speak a minimum number to enter, then they told how many lagnugaes to tested in for 10 minute conversations with 5 minute breaks with natiev speakers.

the winners were given points in 22 languages.

i did an online survey of people who said they knew 6 or more languages; 26 was the highest reprted on the survey of 400 people who responded.

when you meet someone who says more (someone on the survey said over 500…) you know its not for real.

people can keep 5-9 languages no matter their IQ, cognitive skills, etc. Somethhing about the HUMAN BRAIN that it can hold that many languages without any extra effort. up above that to 22, then there is a patchwork of proficiency.

this is a bar chart of one persons mix. this shows reading is less mentally taxing than speaking.

here is a line graph of total, easy, and bilingual. easy means they said it was easier than for other people. and bilingual is those who were born into language learnig. you see it peaks at 6 and quickly tails off.

172 who knew more than 6, and 289 people who said learning languages seemed easy for them. the profiles are quite similar. overwhelmingly MALE. 69 and 65%. might be maleness of online population, the maleness of social networks advertising the survye. historically its male dominanted which are perhaps historical patriarchy.

i asked people in both groups, why they thought they were unusual; 50% said it was innate talent. 20% said they were smart. 45% said they worked hard. I would have thought that one would be higher. [more stats]

11+ lgnauges, they were mostly european, roman script, national languages. non latin? farsi, japanese, mandarin, cantonese, korean, thai, hindi. averaged 9 languages over 5 scripts.

How Unicode has impacted this community? its HUGE! openlanguage, anki, omniglot, chinesepod.

can anyone be one of these people?


They have an atypical neurology. they have repetitive activities needed to put languages in their head, like training on flash cards. if there were less of them historically, 1 its because learning a langauge takes you out of your community. a hyperpolyglot is less social by definition, ironic given languages is for socialising. but online people can be together, in a tribal way, about their learnign techniques and materials.

unicode is key to this.

they like these tools to be free 🙂

they represent mutlilinguals; post-monolingualism is only 20% a joke. their need for materials is as high as a native speaker even if they cant fully read it all yet.

unicode is key to building a tech and social environment to reinforce their neurological predilections. so i think we’ll see more emerging. maybe i see it since i went looking but i think its a real trend, not just confirmation bias.

Q: i have 7-10 languages conversationally and 14 i studied. but i forgot a lot of them. how do you count how many people know?

A: I left that to people themesvles. i wanted to deinstitutionalise it. ‘I speak kannada!’ ‘did you take a class? a test?’ ‘no, i just picked itup’ – so if you think you learned it and its useful in your life, you learned it, i think. so youre free to say as many as you want 🙂

Q: i have an idea about the gender gap. men have more free time than women! 9 hours a day with 2 kids???? [applause] and where are they on the autism spectrum? its like a highly competitive sport. did you speak to them about their motivation?

A: Yes, and i went into neuroscience. the gesein-alliberta (?) hypnothesis is an endo-cranology theory of handedness and maleness, dyslexia, asthyma, etc. there are sensitive period in fetal development that change people, tending towards left males who are good verbally and not spatially. alexander doesnt drive a car! its a tell tale. but its a hugely controversial theory that you cant amass enough data on it. but yes, the gender thing is interesting. i joke that the next book will be ‘the wifes of polyglots’ 🙂

Jungshik Q: He alreayd has a PhD, what else does he do?

A: He was a professor in lebanon; he was unemployed, doing freelance translation and living from investments. then he moved to singapore for a job.

Jungshik Q: any brain scans?

A: No, people who have 4 languages in Switzerland did this though. A hyperpolyglot who spoke 68 languages was studied too. I’d like them to donate their brains but they’re using them 🙂

Q: I have lots of friends who dont drive, its normal 🙂 I also see english as a primary languaes as a bais in the survey.

Q: Where can we see the survey and data?

A: I’d like to put them on the web, the results are described in the book. but its a survey, not a census.

Q: Would be graet! at wikipedia we have many bi or trilingual speakers.

A: it was up for a year 🙂

Q: you mentined languages in the same family; a language is a dialect with an army 😉 swedish and norwegian, there isnt a clear cut distinction. its like saying i speak us english and scottish. someone who speaks a few each disparate languages is more impressive than many similar lgnauges.

A: right, in 1990 cpmetition they were national languaes, no dead ones, no invented ones, and all oral tests. for me, if people CLAIMED it was a differntl alnage, i let htem have it. when alexander says he wont tell me how many languages he speaks, there is meaning there: what does a language weigh? how do you measure diveristy? do people who speak 6 langues for basic conversation without reading them have more complexity than a monolingual educated person with a very high vocab and range of dialects? the number are not great data points.

Q: is there a universal grammar?

A: There is someone, Christopher, who has a single grammar, enlgish, and he ‘calks’ all other langauges through that garmmer. in the 2 contests, judges were told to disrupt the contestant when they started, in the 2nd one, so they couldnt talk about a practiced topic like golf, they switch topics to archeology.

Q: calks?

A: most languages have a wide choice in grammars, but in a foreign languae you can learn just a few and rely on them. or greek people speaking enlgish will use their root words that are valid english more than french or german words that are synonymous.

– – –

Bringing Multilingual PDFs to The Open Web: Bringing PDFs to the web, a review of PDF.js – Adil Allawi

I started this work in 1982 🙂

This project is one of the coolest out there today. They had an idae to do something REAL with javascript, rendering PDF in JS 🙂

It took on a life of its own and is now the default PDF renderer for Firefox and used by Google in their performance testing suite.

This is the sumerian kind kneeling before a tablet with writing. writing is key. as vint cerf’s talk mentioned, PDF is key to preserving documents.

adobe’s postscript was an interpreted langauge for printing in the 80s. taking that and making a document that opens everywhere took off in the 90s. i lived and breathed postscript but its not commonly known these days; i coded on PS interpreters in college, it was great. you wrote a program that made a page, you could let the printer’s intenral computer do the complicated layout processing. that made adobe, it took over the printing world.

Dr Warnock was the founder and CEO of adobe, who made the “Camelot Paper” in 1991, about the future of this. they saw people sending faxes, faxes were digitisations for sending documents across the phone network. postscript being an interpreteed langues meant you cant redisplay a document without rerunning the document program, which may lead to a different result. PDF was the result, to replace fax by sending documents across computer networks, and it too then took over the printing world.

What is a PDF?

its a text format, you have ID number, object data, etc. it was FLATTENED postscript, supporting all font formats used in 1991, and it could FALL BACK for latin text using ‘multiple master’ fonts. Adobe reader had a MM font that would stretch to match the missing font.

ISO then standardised it and fatures crept in; JS engine, forms, video, flash, 3d, and STRUCTURED elements.

PDF.js is this whole PDF structure can be put into a browser 100%. you give the browser a PDF and the browser renders it. pdf.js is a user api, it depends on stream.js to deal with objects, evaluator.js, spec.js canvas.js, etc.


PDF documents are EVERYWHERE. there are billions of them. its important to see if web technology can match the rendering of PDF, and the implmentation effort for pdf.js has helped improve JS engines. since JS runs everywhere it helps prevent bitrot. also security, external PDF readers expands the trusted codebase. adobe reader is a large plugin, it can be hacked,

Q: so it translates a pdf to js and renders that?

A: PDF is just a dictionary of objects. …

Q: When i see a pdf i feel i can rely on that im seeing what the author intended. when i think of JS, i dont feel sure i see what the author intended.

A: right, its just a interperted language. but pdf can be rendered differently by different apps. ISO standardisation is meant to reduce this. Microsoft’s OOXML doesnt guarantee matching rendering across implmenations, PDF is meant to.


You can see the text layout chunking in a background process.

There is a ‘viewer.html#pdfbug=all’ name you can add to the URL to see a self debugger.

We see this in not HTML5.

Q: does this mean text is not selectable?

A: it is selectiable.

SVG cant handle glyph drawing or a stream of glyphs. SVG requires a DOM, if you add too many DOM objects it becomes too heavy. 10,000s of objects each with their own payload.

Unicode in pdf?

Pre unicode fonts, 8 bit simple fonts, composite fonts, opentype (CFF and TTF). fonts have glyphs names

Arabic and Bidi; detect arabic lines, reverse ligatures, and reapply bidi algo. this will do the right thing and allow copy and paste to work.

what isnt solved? its hard to recognise cols and paras.

some indic languages use 2+ glyphs for a single char.

What is needed?

PRINTING. We need an API for printing from CANVAS tag

This entry was posted in Knowledge. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


  • What we do

    Understanding Fonts is a type design training business. If you'd like an event in your college or city, let Dave know: dave@understandingfonts.com