Rusty, abandoned bytes.
abandoned bytes

Speeding up the sampling phrasetable in moses

The project: ModernMT

I am currently on a 6-week codevelopment workshop in the picturesque town of Trento, Italy. The project with the uninspiring, but aspiring title of “ModernMT” aims to make current machine translation technology easy to use and scale. The targeted simplicity implied some design choices that are supposed to quicken the setup and trial of an SMT (Statistical Machine Translation) system: for example, a quick and simple word aligner used in training, and a suffix array index of the bilingual training corpus instead of a static, pre-computed phrase table as the dictionary.

To boost translation quality, we are furthermore experimenting with domain adaptation. The goal here is to have a good general-purpose translation engine that is excellent (beats Google Translate in translation quality) for specialized domains of text, like patents or software documentation. The ModernMT software identifies the type of input text and assigns domain weights to it, which indicate the input’s similarity to the training domains.

The problem: sampling phrasetable runtime performance

One of my recent tasks was to investigate why the current ModernMT prototype was unbearably slow in an interactive translation scenario, where users feed the SMT system with individual sentences to be translated. On the old prototype, the translation of a single sentence could easily take a minute! Looking at simple timing information collected by the moses decoder, the time spent in the sampling phrasetable (“collecting options” as reported by moses) seems unreasonably high, occasionally taking tens of seconds for only a few words to be translated.

The time used to extract phrase pairs from a constant number of samples should be independent of the words queried - unless the queried source phrase occurs less often than the number of samples, in which case sampling should take less time. On a large training corpus of 1.3 billion words and using 1000 samples, querying a manually selected variety of word types with vastly different frequencies yields this plot:

Sampling time for words with different frequencies

(The two sampling variants ranked and random differ in the strategy used to choose samples if there are more corpus occurrences than samples to be taken).

We can clearly see that the most frequent word tokens like the, ,, a etc. seem to take linear time to query, instead of constant time, and this time becomes unreasonable above about one million occurrences of any word. This design obviously fails to scale up with the corpus size for the given use case, since many word types fall into this “slow” category (circled above) with a large corpus.

To explain the problem, it is now necessary to introduce some used concepts from machine translation. If you are only interested in the performance problem of the actual implementation, you can move on to the section “Implementation details” below.

Sampling phrasetable details

In the project setup, phrase pairs are extracted on-the-fly from the indexed bitext corpus and the word alignment as needed by the decoder for a specific input sentence to be translated. For each possible subspan (source phrase) of the input sentence, the sampling phrasetable looks up the occurrences of that phrase in the source side of the training bitext corpus. Then, phrase pairs are extracted from those word-aligned training occurrences to provide translation options. To avoid unnecessary computation for frequent phrases, a specific constant number of occurrences is considered, drawn randomly (samples parameter).

Domain adapatation: ranked sampling using the sampling phrasetable

Collecting and pooling training data of different machine translation customers can provide much better coverage, leading to the availability of longer phrases and therefore better translations. However, there is a problem: adding large numbers of unrelated translations of the same source words may make translation worse for other customers.

Our current approach to avoid this problem changes the phrase sampling strategy as follows: given the identity of a translation customer and some existing training bitext from that customer, we first sample from that same customer’s training data, attempting to gather the amount of samples from the relevant training data. Only if that data is too sparse to get enough samples does the algorithm move on to sample from others’ training data. This way, the sampling phrasetable has a built-in preference for the correct, relevant translations.

Implementation details: the performance problem

Having explained the goal of the algorithm, we can move on to investigate the actual implementation. There is one index of the entire training corpus. So how to determine which customer’s sub-corpora to sample from first? Correct – we need to look up the occurrences in the index and then loop over all occurrences to find the ones belonging to the customer. To ensure the correct uniform distribution of the sampling, we need all the occurrences in the customer’s training data, even if there are orders of magnitude more occurrences than we need to sample. Aha! We have found the source of the linear scaling behavior. But how to implement this faster?

Split indexes

The sampling phrasetable needs to sample from individual customers’ training data. So we need a data structure that can answer the question for a constant number of random occurrences of a source phrase for a given customer in time constant with respect to the number of phrase occurrences. This is possible if we build one index per customer: that way, all the occurrences for the given customer will be neatly sorted one after the other in the customer’s index. Then, we only need to draw samples indices from that range, and only have to sample those indices. It does not matter anymore if the corpus has 1,000 or 10,000,000 occurrences of the source phrase, the sampling will take constant time.

I implemented this split index during the co-development workshop in Trento. My implementation is called ranked3 (as opposed to the original ranked algorithm, which already had an experimental version 2). Here are the results in terms of the same plot as above. Note how sampling time in the relevant part of the plot is now independent of the word frequency:

Reduced to constant sampling time

My implementation improves the scenario where users interactively request translation of individual sentences. In this scenario, we cannot cache the phrase pairs (this is because the actual sampling uses context information to intelligently continue sampling using other customer’s training data with similar text, so we couldn’t, for example, use one cache for each customer). On average, sampling is 8 times faster and final translation times are now 5 times faster:

Speedup after implementing ranked3 sampling

Changing Linux keyboard layout in GUI and CLI

I am proud owner of a Thinkpad T520. It has many design choices that just make it work for me. However, it has a typing ergonomy problem: the large distance of the Home and End keys from the arrow keys on the keyboard. Writing a lot of text and code, I heavily rely on all these keys to navigate, and I find having to move my hand a lot between these keys disturbing. Thankfully, this issue is easy to fix through remapping the Internet navigation buttons placed right above the arrow keys.

Thinkpad T520 Back and Forward buttons

Keycodes for GUI

To find out the keycode for remapping an arbitrary key in the GUI, start xev in a command line, then press the key to be remapped and note its keycode in the output. For my Back button, this is 166:

KeyPress event, serial 40, synthetic NO, window 0x5e00001,
    root 0xbe, subw 0x0, time 2410131, (-57,523), root:(529,552),
    state 0x0, keycode 166 (keysym 0xff50, Home), same_screen YES,
    XKeysymToKeycode returns keycode: 110
    XLookupString gives 0 bytes: 
    XmbLookupString gives 0 bytes: 
    XFilterEvent returns: False

Remapping for GUI

Edit file /usr/share/X11/xkb/symbols/gb (where gb depends on the local keyboard layout you actually use), and add these in a convenient location (I chose inside xkb_symbols "basic"):

// David: remap Back and Forward buttons
key <I166>   {      [ Home ]       };
key <I167>   {      [ End  ]       };

To test the effects immediately, use setxkbmap gb.

Keycodes for CLI

Use showkey --scancodes to obtain the scancode. showkey exits if no key is pressed for 10 seconds. For my Back button, the scancode is 158.

Remapping for CLI

Edit file /usr/share/kbd/keymaps/i386/qwerty/ (unzip, edit, and re-zip):

# David: remap Back and Forward buttons in CLI
keycode 158 = Home
keycode 159 = End

To test the effects immediately, use loadkeys uk. Again, uk is specific to my layout, and you need to know which one you are actually using.

Quickstart: Scala, SBT, Wisp for scientific computing on Linux

Install Scala and SBT

Install Scala and Simple Build Tool (SBT) as customary for your Linux distribution.

  • Debian: install instructions for Scala (Debian has old version packaged), and SBT
  • ArchLinux: sudo pacman -S sbt scala

Set up a barebone SBT project

Create an empty project (arbitrarily called project1) with the correct dependency (Wisp). Wisp is Scala Plotting.

The file build.sbt specifies settings for SBT. Create a new project directory (here, ~/project1). Then create ~/project1/build.sbt:

lazy val root = (project in file(".")).
    name := "project1",
    version := "1.0",
    scalaVersion := "2.11.7",
    libraryDependencies += "com.quantifind" %% "wisp" % "0.0.1"

Start up a Scala REPL (interactive console) and be patient while SBT downloads libraries for Scala and Wisp. SBT will eventually drop you into a Scala REPL with the classpath containing all dependencies:

$ cd ~/project1
$ sbt console
scala> import com.quantifind.charts.Highcharts._

But it does not seem to work. Jetty serves up a 404 page. The folder served from is empty (does not contain the index-* file). But we can download Wisp sources and start SBL for that project.

What is scratching above me?


Multiple mice have been found in our attic before, so I know the noise. The usual solution is a standard mousetrap, which kills our “guests” – not very friendly hospitality…

How about something new?

Here is an automatic mousetrap I built yesterday. The main component is the actuator: A solenoid found in one of Vienna University of Technology’s trash containers. It is triggered by an infrared light barrier.

I caught the first mouse after just 2 hours!

SpamAssassin with Alzheimer's disease?

I love SpamAssassin - it saves me from having to delete around 100 spam e-mails per day which appear in my inbox (it’s been almost a decade that my old e-mail address is active, and it probably appears in all major e-mail address packages that spammers buy). However, as it’s the third time I need to empty and re-start the Bayes database from scratch, it seems that the autolearn feature has yet to learn to forget.

It appears that after some while (around 6 months), the autolearn starts to bias the Bayes database towards treating spam as ham, up until the point that wristwatches and elaborate male organ enhancers with widespread names start to have BAYES_50 score.

The only help seems to be clearing the database:

sa-learn --clear

Obviously, the Bayesian filter needs to be re-trained afterwards, which can either happen automatically based on the other filter mechanisms in SpamAssassin (may take a while), or manually (which is a PITA).

I guess that there is some need to forget continuously, so that new training stuff can replace old, errorneous (self) training records.

I’d appreciate any comments on this issue.

Yaca: CAN signal, ground problems?

bad signal on CANL

Blue: CANH, yellow: CANL - this should be a differential signal.

I can’t connect nodes in the workshop. The problem is old, but I always thought it had to do with the connector there, but now that I’ve checked - no wonder, I would be suprised to see it work like this. On the right side is an oscilloscope picture of the CAN bus lines.

I have figured out by now that this difference in the appearance of the two signals has at least partially something to do with the ground difference. As all nodes are supplied centrally, the ground difference builds up along the wire, and gets up to around 1.2 V difference at the furthermost node.

I’ve read a bit about CAN ground, and it seems like many experienced people are suggesting that there is not only no need for exactly the same ground potential, but that there is no need for a ground line at all. I, however, can’t really imagine what the MCP2551 is going to do to pull a bus line, which is already below its ground potential, further down.

Hence, I’m trying to make use of currently unused wires: From the workshop to the laundry, the cable is a main bus line without a CAN pair going back, and there is no 5 V supply, thus there are 4 additional wires which can be used to strengthen the ground potential in the laundry. The cable to the computers doesn’t have a CAN pair back either (and only the 5 V ground is used), so there are 3 additional wires to transport ground potential.

Altough the signal has now improved a bit (and doesn’t jump around on the oscilloscope screen as much), there are still error frames (sadly I didn’t capture the oscilloscope screen for these yet) now and then, and the communication to the workshop still doesn’t work, altough the rest still does talk.

I will have to read up more on how to track down errors. The cable to the pool could also be worthwile for being “modded” with additional ground lines (5 V pair unused to there). Oh, and I’m sure the TV distribution (or the nodes there) would be happy about some huge elcap too.

I can’t mention how glad I am that this is not all happening at a time where much more things are automated - it’s currently a rather convenient field test as I’m able to shut down the whole bus for measurements without having to sit in the dark, being locked out etc.

Edit 2012-11-25: The problem turned out to be in the distribution box in the laundry, which connects different parts of the bus together. There was a loose contact in one of the connectors, which split one of the differential CAN signal lines into two parts. Interestingly, CAN is rugged enough to work most of the time with only one of the two signal lines.

Brave New World

People still went on talking about truth and beauty as though they were the sovereign goods. Right up to the time of the Nine Years’ War. That made them change their tune all right. What’s the point of truth or beauty or knowledge when the anthrax bombs are popping all around you? That was when science first began to be controlled - after the Nine Years’ War. People were ready to have even their appetites controlled then. Anything for a quiet life. We’ve gone on controlling ever since. It hasn’t been very good for truth, of course. But it’s been very good for happiness.

— Aldous Huxley - Brave New World

Image downscaling algorithm

Lately, my father needed some image scaling code for an application (yep, he is a software engineer). You would suppose that there are many libraries for scaling images, as there are a whole bunch of algorithms like bilinear, bicubic etc. Lots of googling, however, leads to another impression: There are quite a few libraries, but with enormous dependencies themselves, one even requiring the whole GTK+ library!

The project my father is working on is being written in Borland C++ Builder (yuck!). The Borland C++ toolkit itself doesn’t support anything other than scaling with the nearest neighbor algorithm, which looks very ugly (pixely). Tired of googling, I wrote a small image scaling code which doesn’t depend on anything else (than a standard C++ compiler). I’ve made my own considerations on how to scale down an image, but the nearest thing to it is the bilinear interpolation algorithm (I really don’t like the mathematical approach on Wikipedia by the way. I’m a more a programmer than a mathematician!)

The downscaling is done, self-explainingly, by the scale() function. The code is an optimized version including macros, which unfortunately is not very readable. I suggest you just use it ;-) You could also bug me to make a nicer version.

readpng() and the PNG library are only needed to provide some sample image input, which you should replace by some other form of getting pixels. Update: A version with BMP input is now available in the ZIP archive.

At the moment, the code only supports downscaling. It seems that a few % upscaling is no problem though.

Update: Seems like you need the following stuff starting from libpng 1.4:

#define png_infopp_NULL (png_infopp)NULL
#define png_voidp_NULL (png_voidp)NULL

Update 2: Some compressed PNGs were not extracted with the proper pixel data - this bug is now fixed.


sftp annoyance - or how to log in to an sftp server with password from a script

The default sftp program Linux (at least Ubuntu) uses is written with security in mind, instead of the user’s wishes. Stating that “writing your password into a script is insecure” (which it obviously is, I fully agree!), they restrict the password input to the terminal, and somehow the smart-ass programmers managed to find the terminal even when all standard I/O (stdin, stdout, stderr) are passed through pipes.

Obviously, this is a pain in the ass if you want to upload stuff to Sourceforge automatically, as the only solution for automated logins is to use ssh key files. Uploading ssh keys to Sourceforge, however, is not possible, thus, I strived for another program providing better functionality.

Being an ex-Windows-user, PuTTY promptly came to mind. psftp provides nice functionality: it is able to read the password as a parameter and even hides it in the process list output (I guess it must have forked to achieve that). Now, the familiar unix style known from automating ftp sessions can be used to script it:

psftp -pw password <<EOF
cd /some/folder
put some.file

Dismantling and repairing a Ford radio 4000 RDS or the smashed rotary button

Question: What can you do if, in an accident, someone smashed the volume “button” of a Ford radio, and the volume doesn’t work anymore?

Ford Radio key Ford Radio - repaired front pcb


  • Release the radio (without a professional tool for that, the two metal things in the picture were made out of a flower holding facility, the diameter is 2 mm or 2.5 mm with the white lacquer, that still fits)
  • Remove the front panel with a Torx-8 screwdriver
  • Look concerned because the PCB is completely wracked where the rotary encoder was
  • Take a fitting piece of prototype PCB, drill and solder in the rotary encoder
  • Swear because there is no room to put the prototype PCB behind the PCB of the front panel (the rotary encoder and so the prototype PCB will receive all the force from pressing it, so it needs to be mounted solidly)
  • Dismantle the PCB with a Torx-6 screwdriver, put the prototype PCB behind it, mount it again
  • Wire up everything and have fun with a working volume control