Word Frequency Counting in Haskell

As part of the CS240H class at Stanford, we were asked to write a simple program to count word frequencies in some input text.  It was just designed to demonstrate and give a reason to learn the fundamentals of the language.

The requirements themselves were pretty straight foward.  Take in UTF-8 data from STDIN or a list of files provided as arguments.  Count all the words, lowercased and with punctuation removed other than apostrophes, and then print an ASCII-art histogram.  Pretty simple, right?

This problem actually embodied one of the fundamental performance issues I've been struggling with in Haskell: maintaining mutable state using immutable data.  I originally started with a complx state-transformer monad on top of IO, but eventually I found that the code was cleaner if I separated out the code for reading in data and printing it out from the actual counting code.

Instead of doing anything fancy with Iterators and Enumerators I just presented a lazy list of words to the counter, which then fed the resulting map as a list to the histogram printer.  But the intersting part is building up that list.  The map will be heavily modified, and I found that the program spent as much as 33% of its time on garbage collection while looping over the words.  Every word needs to be normalized, duplicating storage, and the state gets updated many many times.  I still haven't found a way to make a state loop as tight as it can be in an imperative language, but I'm sure I'll figure it out soon.

After all, I now have some real pros on my side.

WordFreq - a simple word frequency counter

WoWSim: A World of Warcraft combat simulator

WoWSim is one of my projects that is a work in progress.  It is not complete, though the initial concept is there.

The system is built on DisEvSim.   I had previously written a simulator that used time intervals for simulation, but I found it to be slow and unweildy.  In an effort to improve performance for sparse events like WoW combat (a few events per second instead of 32 ticks per second) I rewrote it with this framework.  It is much faster, but I haven't completed all the necessary features.

Some day I may come back to this work, but for now it shall stand as my largest Haskell project yet.

DisEvSim: A simple discrete event simulator in Haskell.

When I was working on my World of Warcraft combat simulation, I first tried a fixed-time simulation, but I found that to be horribly inefficient, at least at the low number of events I was working with.  So in my second version I endeavored to use a discrete event simulator, which simulates only events at specific times.

To this end I created DisEvSim.  It is a simple discrete event simulator that utilizes an event queue along with event handlers to perform a simulation.  It is intentionally simple as I hoped to keep it general.  Performance may need a little work, but it was fast enough for my purposes.

Unfortunately, I abandoned the World of Warcraft combat simulation project as my interest in the game waned and other work presented itself.  However, I intend to maintain this package for future simulations.

Colorizing the GHCi prompt

A few days ago, I realized GHCi output would be a lot easier to read if I added color to the prompt, making it easy to see what output is part of the last command and what is from the command before. It took a little googling, but I found this example ghci.conf which set things properly. The trick was generating the right characters, since it uses escape codes for colors, like bash, but does not interpret them in the string the same way bash does for PS1.

-- generated with 
--   echo -e :set prompt '"\033[32;1m%s\033[34;1m>\033[0m "' > ~/.ghc/ghci.conf
-- see
--   http://en.wikipedia.org/wiki/ANSI_escape_code--Colors
-- for color info, and
-- - http://martijn.van.steenbergen.nl/journal/2010/02/27/colors-in-ghci/
-- - http://www.haskell.org/haskellwiki/GHCi_in_colour
-- for ways to color other parts of ghci output (but not the prompt :P)
:set prompt "[32;1m%s[34;1m>[0m "

Basically, you will want to run echo -e :set prompt '"\033[32;1m%s\033[34;1m>\033[0m "' > ~/.ghc/ghci.conf from the prompt to add the correct characters to the file. After that, just restart GHCi and you should have a nice green and blue prompt.

Using Stored Procedures in Yii

Stored procedures can be a very useful. They seem to have three major uses: improving performance, increasing security, and maintaining consistency. However, the movement to more abstract data-access layers has made it harder to fully utilize the database. These libraries go a long way towards establishing some best practices, and are a huge step forward from hand-crafting SQL, but they tend to make it difficult to do anything outside of what they were originally designed around.

Background and motivation

I recently came across a problem using the MySQL Geometry type with Yii and ActiveRecord. I have a table of locations, one of which will be the user’s location, and I want to be able to display items on the home page ordered by distance from the user. It’s a simple spatial lookup. I pre-populated the database with US cities, 141989 of them, for testing.

mysql> desc location;
+----------------+--------------+------+-----+---------+----------------+
| Field          | Type         | Null | Key | Default | Extra          |
+----------------+--------------+------+-----+---------+----------------+
| id             | int(11)      | NO   | PRI | NULL    | auto_increment |
| address        | varchar(255) | YES  |     | NULL    |                |
| longitude      | float        | YES  | MUL | NULL    |                |
| latitude       | float        | YES  | MUL | NULL    |                |
| city           | varchar(255) | YES  | MUL | NULL    |                |
| state          | varchar(255) | YES  | MUL | NULL    |                |
| country        | varchar(255) | YES  | MUL | NULL    |                |
| postal_code    | varchar(255) | YES  | MUL | NULL    |                |
+----------------+--------------+------+-----+---------+----------------+

I can use some simple functions to actually compute the distance. For more information about the subject, refer to Alexander Rubin’s talk on the subject. Here is a function for computing the distance between two latitudes and longitudes.

CREATE FUNCTION geoDist(lat1 float, long1 float, lat2 float, long2 float) RETURNS float DETERMINISTIC
    RETURN 3956 * 2 * ASIN(SQRT( POWER(SIN((lat1 - lat2) * pi()/180 / 2), 2) + COS(lat1 * pi()/180) * COS(lat2 * pi()/180) * POWER(SIN((long1 - long2) * pi()/180 / 2), 2) ));

Running a query using this we get:

mysql> select count(target.id) from location origin, location target where geoDist(origin.latitude, origin.longitude, target.latitude, target.longitude) < 10 and origin.city = 'San Francisco' and origin.state = 'CA';
+------------------+
| count(target.id) |
+------------------+
|               50 |
+------------------+
1 row in set (1.99 sec)

Yikes, a few seconds on every page load? That won’t do. Adding in street addresses and international it will grow fairly quickly. Well, we really should bound that query so we aren’t looking at every value.

mysql> select count(target.id), geoDist(target.latitude, target.longitude, origin.latitude, origin.longitude) as distance from location origin, location target where geoDist(origin.latitude, origin.longitude, target.latitude, target.longitude) < 10 and origin.city = 'San Francisco' and origin.state = 'CA' and (target.latitude between (origin.latitude - 15 / 69) and (origin.latitude + 15/69)) and (target.longitude between (origin.longitude - 15/69) and origin.longitude + 15/69) having distance < 10;
+------------------+-----------------+
| count(target.id) | distance        |
+------------------+-----------------+
|               50 | 9.9798002243042 |
+------------------+-----------------+
1 row in set (0.08 sec)

Not bad. It will grow fairly quickly when we add more.

There are rather complicated stored procedures one can use to improve the performance, but another option is to use the MySQL spatial types. These types are specifically designed for this type of query. Here is the migration function used.

public function safeUp()
{
    $this->addColumn('location', 'geoLoc', 'GEOMETRY NOT NULL');
    $this->execute("UPDATE location SET geoLoc = GEOMFROMTEXT(CONCAT('POINT(', latitude, ' ', longitude, ')'))");
    $this->execute("ALTER TABLE location ADD SPATIAL INDEX geoIndex(geoLoc)");
}

And here is another helper function I use:

CREATE FUNCTION geoBound(lat float, lon float, d float) RETURNS POLYGON DETERMINISTIC
BEGIN
    DECLARE minlat float;
    DECLARE maxlat float;
    DECLARE minlon float;
    DECLARE maxlon float;
    SET minlat = lat - d / 69;
    SET maxlat = lat + d / 69;
    SET minlon = lon - d / abs(cos(radians(lat)) * 69);
    SET maxlon = lon + d / abs(cos(radians(lat)) * 69);
    RETURN GeomFromText(CONCAT('POLYGON((', minlat , ' ', minlon, ',', minlat, ' ', maxlon, ',', maxlat, ' ', maxlon, ',', maxlat, ' ', minlon, ',', minlat, ' ', minlon, '))'));
END

Now the spatial index lets us make near-instantaneous queries.

mysql> select count(target.id), geoDist(target.latitude, target.longitude, origin.latitude, origin.longitude) as distance from location origin, location target where geoDist(origin.latitude, origin.longitude, target.latitude, target.longitude) < 10 and origin.city = 'San Francisco' and origin.state = 'CA' and intersects(target.geoLoc, geoBound(origin.latitude, origin.longitude, 10)) having distance < 10;
+------------------+------------------+
| count(target.id) | distance         |
+------------------+------------------+
|               50 | 8.69016647338867 |
+------------------+------------------+
1 row in set (0.00 sec)

Note the distance is just the first value in that column. It is not the greatest, least, mean or any other significant value. The others are discarded when the rows are grouped for the count.

Okay, anyway, that’s what I’m talking about. Very reasonable. Nice and fast There’s only one problem. PHP’s PDOs do not have objects for the geometry type, so we can’t convert easily from PHP objects to MySQL spatial types. I considered adding extra objects, or trying to find a way to use GeomFromText(), but eventually I decided the easiest way would be if we could just use a stored procedure on inserts and updates that would update a geometry field to keep it in sync with the latitude and longitude fields.

Using stored procedures with Yii

I wrote two simple stored procedures, one for update and one for insert, and a few functions. I did not use stored procedures for searches, though the implementation should be similar. I wanted to use these with ActiveRecord to make the use of them as transparent as possible and to ensure I still get all the benefits of ActiveRecord such as validation and type casting.

Searching

Searching is easy in my case. We can simply create a named scope that queries the geoLoc field. It will take a distance and return only entries within $distance miles. It acts much like the search() function in the templates in that it uses the values of the current model in the search as the origin.

public function within($miles) {

    if (empty($this->longitude) || empty($this->latitude)) return $this;

    $this->getDbCriteria()->mergeWith(array(
        'select'    => array(
            '*',
            'geoDist(:lat, :long, latitude, longitude) AS distance',
        ),
        'condition' => 'intersects(geoLoc, geoBound(:lat, :long, :within_dist))',
        'having'    => 'distance < :within_dist',
        'params'    => array(
            ':lat'  => $this->latitude,
            ':long' => $this->longitude,
            ':within_dist' => $miles,
        ),
        'order'     => 'distance',
    ));

    return $this;
}

This could be used to find all the locations within 10 miles of San Francisco as follows:

Location::model()->findByAttributes('city' => 'San Francisco', 'state' => 'CA')->within(10)->findAll()

Very nice. Very reusable. Given a location it becomes easy to find nearby locations.

Updates

The easiest is the update procedure.

CREATE PROCEDURE updateLocation(
    IN locId INT,
    IN addr VARCHAR(255),
    IN cty VARCHAR(255),
    IN st VARCHAR(255),
    IN cntry VARCHAR(255),
    IN zip VARCHAR(255),
    IN lat FLOAT,
    IN lon FLOAT
)
BEGIN
UPDATE location
SET
    address = addr,
    city = cty,
    state = st,
    country = cntry,
    postal_code = zip,
    latitude = lat,
    longitude = lon,
    geoLoc = GEOMFROMTEXT(CONCAT('POINT(', lat, ' ', lon, ')'))
WHERE
    id = locId;
END

We can overload CActiveRecord.updateByPk() to handle updating. It is called from CActiveRecord.update(), which called on every update.

public function updateByPk($pk,$attributes,$condition='',$params=array())
{
    Yii::trace(get_class($this).'.updateByPk()','system.db.ar.CActiveRecord');
    $builder=$this->getCommandBuilder();
    $table=$this->getTableSchema();

    $command = $builder->createSqlCommand('CALL updateLocation(
        :id,
        :address,
        :city,
        :state,
        :country,
        :postal_code,
        :latitude,
        :longitude
    )', array(
        ':id' => $pk,
        ':address' => $attributes['address'],
        ':city' => $attributes['city'],
        ':state' => $attributes['state'],
        ':country' => $attributes['country'],
        ':postal_code' => $attributes['postal_code'],
        ':latitude' => $attributes['latitude'],
        ':longitude' => $attributes['longitude'],
    ));

    return $command->execute();
}

Now whenever we update a location, the geoLoc field will stay in sync with whatever we set the longitude and latitude to. The update method is the simplest because information flows only one way, into the database. There is no need to get information back from the query.

Inserts

Inserts are a little trickier. It isn’t hard to build the method, but there is the tricky case of getting the id of the inserted record back into the record. This is necessary if you ever want to add a newly created model to another model via a HAS_MANY relationship.

$location = new Location;
...
$location->save();
$model->location_id = $location->primaryKey;

My first attempt was essentially a copy of CActiveRecord.insert(), but I found that when I used a stored procedure, the database always returned 0 for the last insert id. To get around this I had the procedure return the id of the inserted record, and used CDbCommand.queryScalar() instead of CDbCommand.execute().

The procedure then looks as follows:

CREATE PROCEDURE addLocation(
    IN addr VARCHAR(255),
    IN cty VARCHAR(255),
    IN st VARCHAR(255),
    IN cntry VARCHAR(255),
    IN zip VARCHAR(255),
    IN lat FLOAT,
    IN lon FLOAT
)
BEGIN
    INSERT INTO location(address, city, state, country, postal_code, latitude, longitude, geoLoc) 
    VALUES (addr, cty, st, cntry, zip, lat, lon, GeomFromText(concat('point(', lat, ' ', lon, ')')));
    SELECT last_insert_id();
END

and the insert method is a little more complicated than updateByPk.

public function insert($attributes = null) {
    if(!$this->getIsNewRecord())
        throw new CDbException(Yii::t('yii','The active record cannot be inserted to database because it is not new.'));
    if($this->beforeSave())
    {
        Yii::trace(get_class($this).'.insert()','system.db.ar.CActiveRecord');

        $builder = $this->getCommandBuilder();
        $table = $this->getMetaData()->tableSchema;

        // Note, need to finish the migration for this procedure.
        $command = $builder->createSqlCommand('CALL addLocation(
            :address,
            :city,
            :state,
            :country,
            :postal_code,
            :latitude,
            :longitude
        );', array(
            ':address' => $this->address,
            ':city' => $this->city,
            ':state' => $this->state,
            ':country' => $this->country,
            ':postal_code' => $this->postal_code,
            ':latitude' => $this->latitude,
            ':longitude' => $this->longitude,
        ));

        if($id = $command->queryScalar())
        {
            $primaryKey=$table->primaryKey;

            $this->$primaryKey = $id;

            $this->setPrimaryKey($this->$primaryKey);
            $this->afterSave();
            $this->setIsNewRecord(false);
            $this->setScenario('update');
            return true;
        }
    }
    return false;
}

Conclusions and Remarks

My situation is not terribly complex, but it illustrates how easy it is to use stored procedures for insertion and updating in Yii. At another time I may delve into how to use stored procedures to retrieve data from the database.

Some will say that putting the stored procedures in the model is bad design, in that it exposes some specifics of the database implementation at a fairly high level, however the use of functions helps us abstract this and any other database need only implement that interface, so a migration to another database won’t be too painful.

The procedures are working well for us, and I hope this helps you too.

Inconvenient Convenience

Sometimes making things convenient goes a little too far. I ran into it this morning when I was working on setting up a new site with Ruby on Rails. You may know that Rails is ridiculously convenient for rapid prototyping. It does 95% of the boilerplate work for you so you can focus on the creative work. You can basically just run a few command-line commands, write a few html files, run a database migration, and you have a working site. It’s basically magic.

Fails feels like magic. Magic is another term for abstraction, and even more it is a term for good abstraction. Something feels like magic when you can’t see what it’s doing underneath but it just works. Rails gives this feeling all the time (until you try to do anything complicated and you have to dive into the internals).

However, all abstractions are leaky when it comes to programming. Eventually you’ll see the costs of your heap allocations or your database will get out of sync and things will come crashing down. The more magical that abstraction, the more puzzling when it doesn’t work. When all the details are hidden from you intentionally, the less you know to diagnose the problem. This is when being magical is actually harmful.

My Example

In my case, I was creating a model for a feedback form where users enter in their email, some comments, and a preferred class. If you are experienced with Rails you probably already know what I did wrong. I didn’t see it until I tried to actually create a new Feedback entity. That’s when I got this error in my server logs,

NoMethodError (undefined method `columns_hash' for nil:NilClass):
  app/controllers/feedbacks_controller.rb:4:in `new'
  app/controllers/feedbacks_controller.rb:4:in `create'

Huh? What the heck is that supposed to mean. Let me go check that I defined everything correctly because that almost looks like the class is nil, but then how would Feedback resolve to call new? Let me see if I can get things working in the console.

irb(main):001:0> Feedback.new()
=> #>Feedback id: nil, email: nil, class: nil, comments: nil, date: nil, created_at: nil, updated_at: nil<

Alright, that seems to be working, let’s give it some parameters.

irb(main):002:0> f = Feedback.new({:email => 'email', :class =>
'class', :comments => 'comments'})
NoMethodError: undefined method `columns_hash' for nil:NilClass

Bah! Okay, how about we go back to that original object and just see if we can add parameters to that, because columns_hash implies it’s looking at the database.

irb(main):006:0> f = Feedback.new
NoMethodError: undefined method `has_key?' for nil:NilClass

Okay, that’s weird. Now the function that worked before doesn’t work anymore!

At this point I resorted to Googling and after a few false positives I hit on the answer.

Does your table have a column titled “class”?

/facepalm

ActiveRecord was trying to overwrite the built-in class method of the object. Its attempts to make record access simple, obvious, and easy, collided with full force against some ruby built-ins. The terrible part is that none of this was obvious to the programmer. No where did it warn, no where did it realize what a mess it was making, what a mess I made. I love Rails. I do. It is a great framework. But with great power, comes great responsibility and the realization that mastery always requires hard work and experience.

How I Learned to Stop Worrying and Love the Types

I have a confession to make. I love strong static types. There’s something about them that I just can’t resist. Some people the more flexible dynamic typing. Some folks dabble in type coersion. Some even do it with duck typing.

What I see most of all is that most people don’t understand them. Sometimes they avoid them, sometimes they just try to get by. I want to show you how to use types to your advantage. I want to show you how to learn to love them.

Verbosity

First off, let me get something out of the way. I hate explicit typing outside of function definitions. I have never figured out why Java requires you to repeat the type of a variable twice if you declare it and create a new instance of it on the same line. Java code has a lot of statements like:

Foo bar = new Foo();
Widget w = WidgetFactor.makeWidget(1);

I know that I can tell what type bar is from the new Foo(). I am sure the compiler could as well. If I know makeWidget returns a Widget then w has to be a Widget whether I declare it as one or not.

A system that does infer the type is said to perform type inference. With type inference the compiler will assume it knows nothing about a variable until you do something with it, at which point it will try to guess what type it has based on how it is used. For example, if you assign a number to a variable, then it must be some kind of number. If you type bar = new Foo() then bar must be at least a Foo, but it might be something more specific. The compiler will only complain when the types are too loose to actually do anything with (rare), or when you have a conflict (common).

Working with a language that does type inference goes a long way towards building a love of types. Now on to the more general tips.

Level 0: Where every programmer starts

My first language was BASIC. Most of us learn the basics of programming with simple languages, using them procedurally. Your first C programs will just be a lot of statements in main, maybe a call to a function or two. Your first Perl, Python, Ruby or PHP program will be similar but won’t even have that call to main. At this point all you really care about is that numbers are numbers but aren’t strings. This is the most basic understanding of types you can have. It works fine for your first few programs, but it really won’t get you very far.

Level 1: Encapsulation and Polymorphism

People learn about type systems all wrong. I was introduced to types as most of us are with an object oriented language like C++, Java, Python, Ruby, PHP or Perl. It seems like almost every language that become notably popular in the last two decades had to have some sort of object oriented features. These languages taught me types, and they taught me all wrong.

The first thing they teach you is how to make a class. How to capture a set of data in a class and associate some methods with it. This is a lot better than writing a 4000 line program in BASIC. Toss those functions in a class so they don’t all end up with 20-character prefixes to avoid naming collisions. You can suddenly reuse the variable name! You can also group together sets of data and pass it around as a unit. An User record can have a name, password, privilege level, and any other data you need. But even higher-level languages still see many objects that are no more advanced than structs in C. Of course, this is an improvement, but it doesn’t really go very far towards understanding types.

Then you learn about inheritance, and how a Dog “is a” Animal. This opens up all sorts of new possibilities. You can start separating interface from implementation. If you don’t care what kind of Animal it is, only that it can move, then you will take any implementation of the Animal and know that it has that method. It’s quite convenient. It allows for a lot of neat tricks, but ultimately it is just another way of pushing a few bytes around more effectively. [They learn encapsulation, polymorphism, and inheritance]

Level Infinity: Making Types Work For You

Really understanding types is learning to make your type system capture semantic meaning. I think this is best explained through an example. One of the most immediately familiar to most programmers, is that of string sanitization. If you have users entering data,they will eventually enter it wrong. It’s an infallible fact of users that if there is something no intelligent user of the program would do, some user will find it and do it.

Public and private functions have been around for quite some time. Even C has it in what you expose through your header files. This allows you to hide your implementation and your public interface can act as a contract between library code and application code. Types can do the same thing. You can enforce a contract and guarantee that your data will be what you expect. This decouples the two pieces of code, giving you more freedom and ensure you won’t miss bugs from failure to convert data properly.

Example: String sanitization.

Very often we have to take in strings from users Tnd it is an inerrant law of users that they will find some way to mess up the most idiot proof system. Too get around this we have to change their input to something more sensible. We call this string sanitization because people who give bad input should be sterilized.

If we wanted to take some user input (which might just be part of the URL) and use it to get a page from a blog database, we could do this in PHP with the following code:

$result = mysql_query("SELECT * FROM pages WHERE page_id = " .. $_REQUEST['id']);

The code makes a query string with the reqeusted idea and sends that to the database. This is nice and simple and works fine until some joker requests the page 1; DROP TABLE users --.

It is an example of an SQL injection vulnerability. Sure, a quick mysql_escape_string() will fix it, but eventually you will forget, and sometimes the source of the data and where it is used are are five function calls appart. Figuring out if an unsanitized string is used in a database query is generally an undecidable problem.

Now, let’s stop and think for a moment. mysql_escape_string() takes a String argument and returns a String, but are these really the same thing? Is an MySQL-safe string the same as raw user input? Are there some user inputs that aren’t safe? Are there some MySQL-safe strings you wouldn’t expect a user to type? I would go as far as to say that these two strings are actually different types! One is a String<Raw> and the other is a String<SqlSafe>. These are also different from String<URLEncoded> and String<Base64>. Why do we use the same type for all of these? Strings are just list of characters, but when you put a few characters together you get emergant meaning. Since they are all different, you may wonder if you are ever accidentally using one when we should be using another?

This is where the type-system comes to the rescue. It is great at keeping track of all of these things, and it loves to tell you when you’re wrong. Let me give an example of how a type system can protect the program from ever having a vulnerability due to unsanitized input.

String<Raw> getParam(HTTPRequest request, String<Raw> key);
String<SqlSafe> sqlEscapeString(String<Raw> unsafeString);
List\ sqlQuery(String<SqlSafe> query);

When you get a parameter from the request object you get a plain old string. You’d like to pass that into your SQL query, but you cannot unless you first transform it with sqlEscapeString. If you ever try to pass a raw string, the compiler will helpfully point out your error. You will thank the compiler, insert the method, and code on knowing your program is safe from bobby'; DROP TABLE students --. An added bonus is that you cannot escape a string twice without explicitly converting it back to a raw string, which should raise a red flag outside of your database library.

The type system can be used to catch when data that is represented the same has a different meaning. An unescaped string should never be passed to an SQL query. Sanitized strings are fundamentally different from sanitized strings. A less precise type system trusts you to know the difference. I don’t know about you, but I don’t trust myself to write bug free code. I’ll take all the help I can get.

A Case Study in Enterprise Soul Smashing

It’s amazing how small things in a work environment can make the difference between a job I look forward to on Sunday night, and a job that makes any time too soon to get up. That statement is actually incorrect. It is a few very big things that cause this difference, but they are subtle and abstract. They manifest in small ways that contribute to the destruction of motivation. It is skillful and subtle demolition of motivation that makes large enterprise-y company into a soul-crushing beast that takes in bright, enthusiastic developers and spits out 9-to-5 drones. Like frogs in boiling water they usually don’t understand what or why it happens.

This is the story of how I find my motivation evaporating at a large enterprise software company.

A convenient way to break things out into categories is to use Daniel Pink’s elements of motivation. These three factors are mastery, autonomy, and purpose. For mechanical tasks monetary or equivalent rewards work well, but for creative or cognitive tasks, higher monetary rewards can actually reduce performance. To get high performance for these tasks, people need non-monetary rewards, and that is where these three factors come in. He has written extensively about this in Drive. To get a quick summary of it you can watch the TED talk (18:37) or a longer talk given to the RSA (41:22).

Purpose

Purpose is the feeling of inspiration that comes from writing code that you care about. It is doing something you can be proud of and that makes you feel good in a moral sense. For me, this is mostly fulfilled by writing quality code that solves interesting problems. The problems themselves are interesting to me, even if the application is unclear, limited, or doesn’t affect me. I want to write code that betters me and that other people can build on and learn from.

However, the code I have seen so far is not something I want my name on. In fact my eyes have been opened to how bad code can be and still pass for a professional and world-class product. It is hard to go home happy at the end of the day when I am embarrassed to tell my friends what I work on, and when I don’t want to show the application to anyone because of all the bugs and compromises I had to make. When I cannot give a good technical reason why things were done the way they were I find myself afraid of questions about our design choices.

Unfortunately, I find that a lot of the work I do is constrained by the lowest common denominator. The code has to be written so that it isn’t too different from the worst code so that the worst developers don’t get too confused and the poorly designed system doesn’t break. I spend most of my time trying to work out poor design decisions in the past and not being able to try interesting things and really solve the hard problems.

I firmly believe that if a programming task isn’t hard then it should be automated. Code can be written so you don’t have to repeat yourself, so that it is DRY. All code should be written such that future additions have less and less work required until all that is required to specify only the items that are unique to this new feature. For example, a new button in the navigation bar of a web page should not require modifying every page to add it. It should not require copying the layout to a new page. It should not require copying all the permissions code. It should not require writing all the database query setup, connection, and tear-down all over again. It should only require you specify a new menu item, specify what happens when that button is pressed and when it is visible, and then write only the code that is absolutely unique to that page.

Getting to this point is an asymptotic process. You can’t get there instantly, but ever time something is duplicated, it can be factored out. When pattern gets used it should be abstracted. As the code progresses, it gets cleaner.

At this corporation I find that we have so much duplicate code that at least half of our time is spent copying code and modifying it slightly for the new code because that is how it is done in 100 other places and there isn’t time in the roadmap to change all those 100 places the code is used, there aren’t enough QA resources to make sure all those 100 other things still work, and there are 20 other developers who might get confused if the code is changed from the way they are familiar with.

The corporation becomes afraid of change and the lack of change causes the cost of change to increase each time refactoring is skipped in favor of the quick but dirty fix. These corporations accrue and accrue technical debt to the point that they paralyze themselves. And when you cannot create anything new and cannot even do a good job, you can’t create anything interesting and there isn’t much you can take pride in.

So why keep working if I’m just going to feel guilty about it later?

Mastery

Mastery is the desire to get better at what you’re doing. In me it’s a desire to learn and try new things. I want to learn new ways of doing things, try new things, and work with people who will force me to get better at what I’m doing. I want to feel that at the end of each project I have done something that I would not have known how to do before the project. An interesting problem on its own can motivate me to work harder than I will for money.

I can’t even tell you how many hours I’ve spent experimenting with Rubik’s Cubes and similar puzzles. I take it as a personal challenge to figure these things out, and now that I know some of the basic principles from my 3x3 cubes, I refuse to look up solutions to similar puzzles. I actually make the work harder because I will learn more.

I spend hours and hours programming for fun in my free time. Most of that time is spent trying out new open-source technologies, learning new programming languages, creating web-sites, and working on open-source projects. None of these things have direct monetary rewards for me, but I love getting better at programming and learning new things. The total hours I spend work out to at least a part-time job.

With our code-base so horrible and our inability to do anything new, I cannot explore anything new and I cannot find new ways to improve the code. Improvements are almost frowned upon. I start to worry that my coding will become worse because I will pick up the bad habits I have to employ because I’m not given time to fix them. All that motivation that drives me to work hard at home or in school is lost when I cannot work on something that I think will better me as a person.

I worry that many of the developers at this company are not particularly talented. For example, I can only think of a handful of engineers who I know to program outside of work hours. Most of them are not proficient in any language other than Java, except maybe JavaScript. When I mention languages like Ruby, Python, Scala, Clojure, or Haskell, most have never even heard of them let alone are able to have a serious conversation about the trade-offs between different languages. Many of them don’t seem to be particularly well trained in terms of algorithms and theory. I asked one engineer what their favorite sorting algorithm was and they responded, with seriousness, that it was Bubble Sort. When asked why they responded that it was easy to implement. These are not traits of people who are looking to improve. They do not question the way things are and seek ways to make things easier in the future. This prevents me from being able to collaborate on a high level. When I suggested an improvement to one of our large applications, another engineer told me that we should instead try to keep things consistent with the other groups, so I should drop it.

I can’t converse with these people to learn. I can’t have a debate where we both learn something new. I can’t have them point out a way in which things could be done better because most of them aren’t interested in improving themselves or the code base. Smart people want to be surrounded by other smart people because they want to be understood and they want to have to work to understand what their peers are saying.

How do we end up with all these people? There are two factors. First is that quality people are not hired, because the primary goal is to meet deadlines and that means hiring people who know a few specific technologies and can start writing code towards this deadline quickly instead of hiring people who will take longer to get up to speed but contribute higher quality code in the long run. The second is that people who are happy in a job tend not to leave, so those who want to do things differently will have left and found another job, and those who remain are happy with the system and not terribly worried about changing it.

Being unable to converse critically with my peers and unable to try interesting things within our work really doesn’t allow me to learn anything new at work. Instead I find myself writing boilerplate code work around previous design mistakes. I don’t feel like the organization is really interested in making a large effort to get things fixed. Instead of stopping to fix the problems we have, we have promises that things will be fixed at some indeterminate time in the future. For example, when I joined it was already a joke that things would be fixed in the Big Rewrite and a joke that Big Rewrite would be coming “soon”. This gives the impression that lip-services is given to the idea of improving, but the actual effort is deferred indefinitely.

So why keep working when you’re just going to have to redo it later because you couldn’t do it right the first time?

Autonomy

Autonomy is the feeling that the decisions I make matter, that my decisions are respected, and that I’m given the power to make decisions.

This ties in closely to the mastery comment. I feel like this company does not really value original contributions. People are not very interested in trying new ways of doing things, and are more interested in keeping things as homogeneous and conceptually simple as possible even if that means more complexity, bulk, and bugs in the code. The structure of the code makes it hard to try anything interesting because the cost of change is too high.

This problem also comes up in planning and scheduling. I have the feeling that when suggestions are made for things that need to be done, often they are overruled by the desires of the project managers and the fully-booked road map has no room for what the developers want because management has already filled it.

For example, when I enthusiastically complain that portions of the code need to be rewritten sooner rather than later, I am told to wait an indefinite amount of time for the rewriting project to be placed in the road map when I would rather take the time to do it right now. This leads to many poor work-around jobs that accrue technical debt. I found a portion of the code, that was fairly small and is very poorly written. It hasn’t been touched for about five years and sorely needs an upgrade because the infrastructure can barely support the modifications we are making. I advocate taking a time out on the new features to refactor and fix the old code, but that isn’t in management’s timeline, so I will have to wait for them to schedule it in, and since the roadmap for the year has already been planned out, maybe we can get it in a year or two. In the mean time, I am told, please work on the projects management has deemed important.

I would rather get things written right than have to fix bugs for the next six months. However, we don’t leave time in the sprints to schedule these things in and we when we come down to the wire (which we seem to do fairly often) the rewrites are pushed out first as management doesn’t see them as as necessary as the developers. I see things that I want to improve. I want to dive in and bury myself in improving these things, but there are other routine tasks that management wants done, so those take priority. It gives the impression that my time is not my own and that all my time at work is to be filled by management with management approved tasks. It is coding as a machine, not as art.

Writing good code also means having the correct tools available. In a large enterprise where the cost of change has increased to the point that people are afraid of it, IT and security groups can be the worst of all. Any IT or security group would rather that everyone just stayed home, because that would make their job a lot easier. They have to make some concessions so that work can actually get done, but the less ground they give, the less work they have to do. This creates a tension between development and support groups, and unless management constantly fights for the rights of their developers, we will be stuck using only the tools and technologies that support will let us have, further reducing my ability to do the best job I can.

So why keep working when I can’t work on the things I find interesting?

Conclusion

All of this can be traced back to a company that prefers stability over quality. The most important thing is not to disrupt the current state of affairs lest the new be worse. The company will avoid potential improvements for fear of potential losses, and has been slowly backing itself into a corner so that the only way to forward is through long, slow, hard work.

I want to do good work. I don’t care about the money or the benefits. I just want to create a good product, create good code, and improve myself and maybe the discipline while I’m at it. If nothing else I would like to improve the average code quality in the world. Give me something to work on and I will do that, but first the company has to get out of the way. If I have the freedom and support to do great things, I will do everything I can to do them. If the company is afraid of that and spends their time micromanaging and imposing limits on what I can do then I’d rather be a homeless hacker than an enterprise hack.

There really seem to be three choices.

  1. I could live with it, and let my soul be ground into dust. I’d make good money and I would have to satisfy my intellectual curiosity outside of work.
  2. I can try to change the system and make a difference at this company. This would involve a lot of long, hard work that may or may not pay out.
  3. I can go work somewhere else that is more in line with my personality.

Life is short. I don’t want to waste my time with a company that won’t change and won’t help me better myself. I’m starting to look at option #3 very seriously.

Oh, and to top it all off, they say they are Agile.

My Journey to Programming Enlightenment: How do I get back?

Learning to program is is different for everyone. I picked up BASIC when I was about 10. By "picked up", I mean I had an old 286 with basic and DOS and I thought I'd try to write some games. I think it was a few months before I discovered sub-routines instead of just using GOTO. It was a few more years before I was properly introduced to object-oriented programming, but it wasn't until college that I really started to unlearn my habits from BASIC when I picked up Perl, C, Javascript, and LUA. At my first significant job I learned Ruby and fell in love. I considered myself a reasonably competent programmer. I could hack things together that worked, even if they weren't pretty. None of this prepared me even remotely for what I would learn next.

Read the rest of this post »

Book Review: Solaris by Stanisław Lem

Solaris was originally published in Polish in 1961. It was later
translated into English and it was the 1970 translation by F&F Walker
and Co. that I read. Despite being a foreign book it reads as well as
any natively written book and there is nothing that breaks the
immersion by thrusting the circumstances of the times on the reader.

The story is a stunning exploration of the human psyche. In the
distant future men have discovered a planet covered with a mysterious
substance that may be alive, but with which all attempts at
communication have failed. Kris Kelvin is a psychologist who has been
studying the planet and is now making his first visit to the station
in orbit. While there, he finds that the planet is communicating with
each of the stations inhabitants by creating constructs that exactly
mimic the persons memories of someone they have a strong emotional
attachment to.

Kelvin is visited by his dead wife, who committed suicide 10 years
before. He knows she is not real and attempts to kill her only to
have her reappear the next morning with no memory of what occurred.
He is unable to keep the secret and over the following weeks he must
come to deal with her resurrection, his attempt at her life, another
attempt at suicide, and her realization of what she is.

The tone of the book is reminiscent of H. P. Lovecraft. It is told
from the first person and will diverge for pages about the mysterious
behavior and formations on the planet. It has the same masterful
suspense that draws you through a slow plot with amazing swiftness.

Solaris is another piece of classic science fiction. If you are a fan
of H.P. Lovecraft and other horror or suspense novels, then I highly
recommend it, but I know it may be a bit too tense for some readers.
This is definitely not the sort of book you would want to give to a
young adult.

Wikipedia: http://en.wikipedia.org/wiki/Solaris_(novel)