I can haz success! Unison hack to enable Unicode normalization of filenames

NOTE: The latest development version of Unison now has built in Unicode support. Check this post for how to compile and use it!
DISCLAIMER: This is a very ugly hack! It’s been tested to work in MY setup, but might not work in yours. I really don’t know OCaml, or makefiles for that matter. You have been warned!

ihazsucceessAfter much agony I’ve finally managed to build a hacked version of Unison to make my file sync setup work. The problem, as explained earlier, is that Unison doesn’t support Unicode, and that I have to synchronize files between Mac OSX-machines (using UTF8 NFD-normalized filenames) and Windows machines (using latin1 or UTF8 NFKC-normalized filenames). To make filenames containing non ASCII characters transfer correctly, some kind of conversion has to be made, and as of now Unison does not support this.

In my file sync setup, I have three OSX machines synchronizing files using a Windows server as the central node (all OSX machines sync with the Windows machine). Synchronization is always initiated from one of the OSX-machines. What I have done is to install Cygwin on the Windows machine, and also install a hack for Cygwin which enables UTF8 support.

When I first did this I thought it would be enough, but since Windows/Cygwin and OSX uses different Unicode normalization (NFKC and NFD) the bit-by-bit representation of the filenames are different. This is what I set out to fix. I have inserted a few lines of code in the function the preprocesses filenames before comparison is done in Unison. Those lines uses the Camomile Unicode library to normalize the filename to NFKC, so when the OSX and Windows filenames are compared a little bit later they will be bit-wise identical.

This is DEFINITELY not the best way to do this, and does not by far fix all of Unison’s encoding problems. What one should do is to rewrite all of the filename handling to support Unicode and also other encodings. But I don’t know OCaml very well, in fact I find it quite confusing and frustrating, so for the moment this will have to do for me.

And it seems this is enough to fix my problems. The hack only needs to be applied to the OSX-side of Unison to work, even though it would probably be better if it was applied to both sides (but I’m WAY too lazy to try to compile Unison in Cygwin if it seems I don’t have to :P ).

So, if anyone needs to sync an OSX machine with a Windows machine, or perhaps with a Linux machine with a UTF8 filesystem, this could perhaps be of some help to you. (Note that while OSX and Windows/Cygwin enforces NFD and NFKC respectivly, Linux does NOT. So in Linux it would be possible to have to two different files with seemingly identical names, but with different normalization. This would obviously not work well with this hack, but that would probably be a less than ideal situation anyway.)

Quick install:

This is the quick install for people who don’t want to compile stuff.

  1. Download my precompiled (OSX Leopard) Unison binary here: unison-unicode.zip (600KB, based on Unison 2.27). You only need the modified binary on the OSX side (as long as synchronization is initiated from that side), but all other machines must use the same version of Unison (2.27).
  2. Download the Camomile data files (5MB). These files must be extracted into /usr/local/share/camomile on your OSX machine (hardcoded, sorry!).

Build yourself:

These are instructions for how to build the modified Unison version yourself (for OSX, but might work on other architectures as well):

  1. Download and install OCaml.
  2. Download and install/build Camomile (follow instructions and use the default installation directory).
  3. Checkout a version of Unison with Subversion (I’m using /branches/2.27, but I think it will work with the latest beta version as well).
  4. Replace the files src/case.ml and src/src/Makefile.OCaml with these files.
  5. Compile using “make UISTYLE=text”.
  6. The new Unison binary will be at src/unison. I would recommend you rename it to unison-unicode or something to tell it apart from your regular Unison version.

Your modified binary (from either the quick or full install) will enable you to synchronize files with Unicode filenames between an OSX machine and another machine with a UTF8 filesystem (for example Linux). If you want to sync with Windows you need to install Cygwin (make sure to select the unison package during installation) and the Cygwin UTF8 hack as well (make sure it’s the cygwin unison binary that is being used during synchronization, use the parameter “-servercmd /usr/bin/unison”).

Note that this version of Unison requires that the two file systems being synchronized are UTF8, if it encounters a filename that is not valid UTF8 it will probably crash!

If anyone actually tries this, please post your comments below! Thanks ;)

, , ,

8 Comments

Unison Unicode problems

Unison is a pretty awesome file synchronizing utility. It’s free, open source, highly customizable and scriptable. It does, however, have one big flaw: it doesn’t support Unicode. As long as you synchronize between file systems of identical encoding, it doesn’t matter. Unfortunately however, Windows, Linux and MacOSX all use different encodings per default.

My setup synchronizes files between 3 different OSX-machines using a Windows server as the central node. File names containing non-ascii characters like ÅÄÖ gets messed up when transferred, eg. the OSX file räksmörgås.txt will appear as räksmörgaÌŠs.txt on the Windows machine.

This is very annoying. I really like my synchronization setup, and this is the only problem I have with it. What to do? Windows uses latin1 encoding for file names, and OSX uses utf8. What if you could trick windows into using utf8 also? Linux supports utf8 file names, so maybe cygwin can help. Nope, turns out Cygwin does not support Unicode… Googled “cygwin unicode” and found a hack to cygwin which enables Unicode and utf8 support for file names. My hope was rising as räksmörgås.txt seemed to correctly appear on the Windows side. Yes I had done it! Ran unison again to to double check, and the file was now for some reason flagged as new on the windows side, and the whole operation failed when unison tried to copy the file back to the OSX side and failing when discovering that the file was already there.

So, it turns out that there is such a thing as Unicode Normalization. Short story: The same character can be represented in different ways in Unicode, namely composed or decomposed. And, to make matters worse, OSX uses the decomposed form (NFD), and Windows/hacked Cygwin uses the composed form (NFKC). So even though the file is called räksmörgås.txt on both machines, the exact bit representation of the name is different. If I had used a Unicode aware program, this wouldn’t have been a problem and the file names would have been recognized as identical. But as I said, Unison is NOT such a program…

I’ve done some research (ie, googled) there doesn’t seem to be any plans to incorporate Unicode support in Unison. It turns out Unison is written in OCaml, which doesn’t nativly support Unicode, so adding support for this would according to Unisons developers be pretty hard.

But how hard can it really be? I just need to make sure that both filenames are normalized before they are compared. And there are third party libraries to enable Unicode support in OCaml. So I went off and downloaded the Unison source code, the OCaml binaries, and the Unicode library (Camomile). It was pretty easy to locate the piece of code where the normalization should, or at least could, be done. Only one problem remains: Camomile is very poorly documented, and comes with absolutely no example code! Right, two problems: OCaml is a functional languange (like Haskell), and it turns out I hate functional languages!

To be continued (hopefully)…

UPDATE: Problem kind of solved!

, , ,

No Comments

Hosting scare-of-the-day (eVerity)

everitydown3Short story… My email went down. Checked my websites hosted on the same server; also down. Checked hosting company’s homepage; also down (or, only showing a status message as seen to the right).

So far so good, they seem to be working on it. Clicked on the LIVE HELP link and asked them to confirm that email, and not just mysql, was down for the moment. Got reply:

“These problems started when we restored a backup of YOUR site. Hacking is a crime! You need to be getting yourself a new host asap!”

At about the same time all my sites started to show an “account suspended”-message. Ehm… Well, I don’t remember hacking my own server, and honestly I didn’t even know I could hack. I must say this gave me quite a scare, since I thought I might be loosing all my gigs of email I keep on their servers.

So, it turns out they had confused me with someone else (some guy’s account got suspented, and he used a friend’s account on the same server to get back at the hosting company). Sites are up, and email is up save some DNS problems. All well that ends well, but really, this is not the kind of greeting you’d like when you inquire about why your email is down :/

No Comments

Dreamhost Backup problems; stay away from them!

So it seems Dreamhost have once again lost all my files on their backup server, or at least I’ve lost access to the them… This whole ordeal started more than three weeks ago, though I didn’t notice it at first. At one point they sent out an email explaining and apologizing for what’s happened, so far so good, accidents can happen.. But they email also said that now it was OK to reupload the files, and a few days later they seemed to be gone again.

Now it’s beeen more than a week since the last status update from Dreamhost, and still the backup server seems to be going up and down all the time. No word on what’s going on, or when I can plan to start backing up my data again. This is extremely bad! If they just would give me a status update I’d be prepered to give them some more time to get their house together, but they keep silent. I’d have to recommend everyone to stay away from Dreamhost! A backup service that looses the data for all their customers should be a bad enough sign already, but when they even fail to communicate properly on what’s going on, that’s just the last straw.. I’ll be asking for a refund, and thank my lucky star I didn’t host this blog with them….

UPDATE: Got a response from Dreamhost with apologises, and assurances that the backup server should now be fully functional again. We’ll see….

No Comments

My backup host is psychic! Or maybe not…

Two months ago I got myself a 50GB backup account for personal use (photo library, documents, code and such), and played around with rsync until I got a working backup script. However, for some deranged reason, I never actually got around to scheduling the backup script to run every night… Well, now I felt it was time to remedy the situation, and maybe write another blog-post while I was at it.

At the very same second I decided this, an email arrived from my backup hosting company (Dreamhost) with the ominous subject “Backup server problems”..! Now, honestly, what is that? Fate? Can they read my mind or what? This is seriously creepy…!

Hello Albin,

I’m very, very sorry it has taken us so long to report back to you regarding your backup user status.  For the last week and a half we’ve been working with 3Ware to try and revive the raid array, but so far have still had no success. You can read more about it here:

http://www.dreamhoststatus.com/2009/03/07/backup-server-problems/

At the moment, it is really starting to look bad, and we have more or less given up hope on being able to recover the data. Again, I’m really horribly sorry about this.

Hopefully, as this loss only affected our “backup” service (intended to only be used as your backups), you have another copy of all the data that was there. We have now at least gotten the backup server back online (without any 3Ware raid cards this time), so you may begin using it again to re-backup your data.

To try and make up for this a tiny bit, I have just now applied a $20 account credit to your account. (The total of any amount you’ve been charged for backups since October, plus  extra.) Again, I sincerely apologize for the inconvenience. If you have any questions, please respond to this email and we will all do our best to help you in any way we can.

Sincerely,

So, it turns out the they had begun experiencing problems with their backupserver about two weeks ago. Since my backup script hadn’t been running I hadn’t even noticed. But get this, in the end they were unable to recover the data, and gave up! All data on the backup server gone. For everyone! And yes, when I logged in now my backed up files were nowhere to be seen.

So.. Well for me it’s actually not a big deal, since all my data is safe and sound on my healthy 1.8 TB RAID5 array at home, and my backup script hadn’t been running anyway… And I guess a $20 credit is better than nothing. But honestly, this must really be the last kind of email you’d like to get from your backup company…

(Dreamhost isn’t actually a backup company, but a regular hosting company. Since last summer they’ve been offering an additional 50GB backup space with every hosting account, which as it turns out is cheaper than most of the regular dedicated backup providers. But I guess this is why :P )

So, well did I ever get around to setting up the backup schedule? Yes I did! My home server is running WinXP and Cygwin, with rsync and cron. The command I use for rsync is:

rsync --delete --protocol=29 -avP /cygdrive/d/backup-folder bXXXXXX@backup.dreamhost.com:~/backup/

My local backup-folder is actually a collection of symlinks (yes you can use symlinks in Windows, but they are called junctions) pointing to the folders I actually want to backup. As you can see rsync is not so difficult to use (the protocol=29-stuff is Dreamhost specific, my thanks goes to Climens Codelog for the tip).

After some meddling with crontab (turns out crontab and emacs don’t play nice, at least not on Cygwin), I got the schedule up and running as well with:

0 4 * * * rsync --delete --protocol=29 -avP /cygdrive/d/backup-folder bXXXXXX@backup.dreamhost.com:~/backup/ >> ./backup.log

Voilá! Automatic backup scheduled to run every night at 04:00. Finally. Just hope that Dreamhost won’t mess up again…

UPDATE: Dreamhost messed up again! The files i uploaded yesterday are gone! Nothing new in the Dreamhost Status blog… What’s going on?

UPDATE 2: My files have mysteriously reappeared. Still no news from Dreamhost on whats going on…

, ,

2 Comments

Thunderbird 3 Beta 2

I know I said that I would switch to Mail.app, but somehow I still use Thunderbird from time to time… Anyhow, Thunderbird 3 beta 2 was released a few weeks back, so for those who thought that beta 1 sounded too unstable, maybe now it’s time to give Thunderbird 3 a try? As I’ve said before, for me the main reason for updating is that Thunderbird 3 natively supports the OSX Address Book!

, ,

No Comments

3-col vertical view in Apple Mail (OSX 10.5)

Mail.app with 3-col vertical layoutWhen I made the switch to Mac and OSX, for some reason I stayed reluctant to trade my old trusty Thunderbird for OSX’ native Mail.app. But unfortunately Thunderbird doesn’t support the pesky Outlook calender invitations my company always spam me with, and I heard a rumor that Mail.app does, so I thought I had to give it a try then after all.

It turns out however, that I really, really, really, miss the three column vertical view from Thunderbird (and Outlook for that matter). For some strange reason Mail.app sports the older 2 column split mode… Fortunately, as always, Google has the answer. I quickly found WideMail by Dane Harnett, a modification to Mail.app that enables a three column view just like in Thunderbird. Also has many options to further customize the appearance. Thanks Dane!

, ,

No Comments

Two iCal feeds you need

picture-1iCal in OSX makes it easy subscribing to external calenders, and two such feeds you will definitely want to have are holidays, and (at least if you live in Sweden) week numbers! However, it turns out that at least most Swedish holiday calenders over at icalshare.com were pretty crappy, as none of them seemed to have been updated for 2009 yet…

But luckily I stumbled  upon this nice feed generator over at Oops.se! They have customizable holiday feeds for Finland, England, Denmark, France, Sweden, US and Norway, as well as week number feeds for Sweden and US. Best thing is, they are automatically updated so next year you will automatically get next years holidays! (Remember to set iCal auto-refresh feature to Weekly.)

(For the lazy swedes out there, here is the current Swedish holidays iCal feed as well as the week number iCal feed.)

, ,

No Comments

Thunderbird 3 beta 1 supports native OSX Address Book integration

tb3b1abWell, it has been out for a while, but I just noticed that Thunderbird 3 beta 1 actually supports native OSX Address Book integration without any hacks! Just open the Address Book in Thunderbird and activate “Use Mac OS X Address Book” in the file menu. Let’s just hope beta1 is stable enough for every day use.. I’ll be using it from now on anyway, so we’ll see!

, ,

No Comments

Syncing OSX Stickies

I’ve been missing the ability to sync notes since I abandoned my last Palm… I used the Palm/Outlook synchronized notes for everything from remebering song lyrics to keeping track of how much money my friends borrowed to buy ice cream.

In OSX  there is bundled program called Stickies which displays lots of colored post-it-like notes on the screen. It’s not really what I want since you have to have all your notes on screen at the same time (if you close one, it counts as deleting it!), but I guess it will do for now. At least as long as I can sync the notes between my OSX machines!

And, it turns out its pretty simple. The notes are stored in ~/Library/StickiesDatabase, so all I needed to do was to add this file to my OSX-prefs Unison sync profile! Of course, you can’t go around editing your notes on two computers at the same time without syncing, but most of the time I’m only using one computer at a time, so it shouldn’t be that much of a problem.

, ,

No Comments