mjl > projects > bomstrip

Introduction

Bomstrip is a very simple tool that removes BOM's (byte-order-marks) from utf-8 files. Actually, it is a set of tools that all do the same thing, but - for added entertainment value - in multiple programming languages (python, c, java, brainfuck, ook!, perl (twice), sed, postscript, pascal, unlambda, limbo, haskell, ocaml, php, ruby, c++, forth, awk). You want to always have this tool within hand-reach, no matter where you are and which compilers/interpreters you keep close to you.

Each tool reads from stdin and writes to stdout. It accepts no options or arguments. It never writes into files directly. All files are public domain. It exists for the purpose of noting how stupid BOM's in utf-8 files are.

Oh, in case you didn't know yet: utf-8 does not have byte-ordering issues, so there is absolutely no need to have three bytes (the utf-8-BOM) that do not say anything about the byte-order (since there is nothing to say).

I want it!

Wow, you are impatient! But you're lucky! You can have it! It's free! Get the latest version now: bomstrip-9.tgz. YEAH.

Recognizing a utf-8 BOM

The utf-8 BOM can be found at the start of some files. It consists of three bytes: EF BB BF. This is the utf-8 encoding of unicode character FEFF.

Reasons for a BOM in utf-8 encoded files

Reasons for not having a BOM in utf-8 encoded files

Why do people do this?!

Honestly, I don't really know. This is one of those mysteries that might never get solved. Oh, there is one lead: it seems to be generated mostly (exclusively?) by Windows systems. Really, who would have thought?

How can I help?

Of course you want to help in the noble quest of removing all utf-8 BOM's around. WE NEED YOUR HELP! Write bomstrip in your favorite language and send it to me at <mechiel@ueber.net> for inclusion in the next version. We still need implementations in the following languages: c#, whitespace, prolog, shakespeare, lisp, erlang, lua, tcl, visual basic and so many more.

Disclaimer

I do not guarantee that this program strips BOM's. I do not guarantee that this program does anything at all. If this program does or does not something to you or your files that you do not or do want, I cannot be held responsible. Okay, that feels much safer.

News

24-06-2008 - bomstrip-9.tgz
Fresh implementations by Peter Pentchev! Bomstrip-9 now comes with another perl implementation, a one-liner. And a c++ implementation, and a forth implementation, and an awk implementation (well, a cripled one, since it does not run on the one true awk). Peter Pentchev also gave some improvements to the c & python implementations. And changed the test script to make testing easier. Many thanks!
11-02-2008 - bomstrip-8.tgz
After some time inactivity, a new version thanks to Andrew Gerrand! Many thanks indeed for his version of bomstrip in PHP! For added bonus, I've thrown in a little ruby implementation. Keep them coming!
18-09-2005 - bomstrip-7.tgz
Second release today! Just created an ocaml implementation. Enjoy!
18-09-2005 - bomstrip-6.tgz
Wow, we're on a roll. Today brings implementations in limbo (nice language) and in haskell. Both by yours truly.
17-09-2005 - bomstrip-5.tgz
Now with implementation in unlambda by Matthijs Bomhoff. Thanks a lot! This is getting more impressive each release. But remember, we are not there yet. More!
10-09-2005 - bomstrip-4.tgz
New implementations in Postscript and Pascal. Thanks to Berteun Damman. Great! Keep them coming!
07-09-2005 - bomstrip-3.tgz
New release. The previous java version has been replaced by one that is more java-style (not the C rewrite it was in version 2). Thanks go to Ruben Smelik for java-ifying!
06-09-2005 - bomstrip-2.tgz
Second release. Now with implementation in sed (thanks Andreas Gohr), java (by me), brainfuck (thanks Berteun Damman; run it with interpreters bff or nbfc or another interpreter that reads -1 at EOF), perl (thanks Matthijs Bomhoff) and ook! (thanks Berteun Damman). Enjoy the ride.
06-07-2005 - bomstrip-1.tgz
First beta-pre-alpha release! Unfortunately, I'm too lazy to make a sourceforge account, CVS repository, mailing lists, issue trackers, freshmeat announcements, precompiled binaries for all linux distributions, packages for debian, gentoo, *bsd and all the others. In short, this project is not yet as cool as it could and should be.

Paranoia

sha1(bomstrip-9.tgz): 70c8b03df90e66c745fe9b5b5ff6790a0ecd32a1
md5(bomstrip-9.tgz): 93184de71a25831fa03ec49f0bca3e34