2011-07-20

How to trim NonBreakingSpace ( ) when parsing html

The 'non breaking space'-character will not be removed by the trim funktion in Java and Scala. In Java you can use regexp (see this blogpost) to do this, but Scala has a better way.

In Scala you can use higher order funktions to handle Strings, a String is an array of Char's and thus you can use the filterNot funktion.

scala > val ns = <div>1 2\u00A0 3&#160;</div>
ns: scala.xml.Elem = <div>1 2? 3?</div>

scala> ns.text.filterNot(_ == '\u00A0')
res1: String = 1 2 3

2011-04-01

Framework's - Do or Think

I recently gave a live coding session, where I demonstrated functional concepts and how powerful they are for keeping concerns separated.

The exercise consisted of how to separate what you do with a resource, from how you handle the lifecycle of that resource.

The problem can look something like this:
def writeToFile(file: File) {
  val writer = new PrintWriter(file)
  try {
    writer.println("Hello Kitty")
    writer.println(new Date)
  } finally {
    writer.close
  }
}
This function is totally locked down. It isn't even easily testable because "now" is written to the file, even though we have no way to influence what is written we can't test "now" since it is an ever moving concept.

The solution is simple, put the handling of the PrintWriter's lifecycle inside the function and put what to write to it in the call to the function.

The function now looks like this:
def writeTo(file: File)(op: PrintWriter => Unit) {
  val writer = new PrintWriter(file)
  try {
    op(writer)
  } finally {
    writer.close
  }
}

And calling it looks like this:
val now = new Date

writeTo(file)(
  writer => {
    writer.println("Hello Kitty")
    writer.println(now)
  })

Here we can test "now" since it is declared outside the function call. The function turns out to be fully reusable and we can write whatever and however we want to the file.

Furthermore, when writing the tests for this function, the same exact pattern popped up two more times - separating the lifecycle of a resource - from the use of that resource.
Once when validating the written content in the file, and once for handling the test file itself.

So, separating concerns turns out to be very easy in Scala.
-------------------------------------

Reflecting on this I came to an realization:
When using a language that really help you separate concerns, you are much less dependent on the external framework making things pretty for you.
-> An api that "do stuff" can be made much more focused, since it's so easy to make the "use of it" pretty.

But there is another side to this.
When writing my tests I use Specs, this because I like to formulate my thinking while testing in a BDD way.
-> There is another way to write frameworks, the DSL way, who's main purpose is to let you formulate your thoughts.

When writing an api it is tempting to always go the "Think" way because it appears to be more profound, but this can make things more complicated than it has to be.

A Think framework is usually harder to pick up, since the picking up consists of learning the problem domain described by the DSL.
Picking up a Do framework mostly consists of finding out what function to call.

When using a truly full-featured language like Scala it is important to realize that you doesn't have to provide for everything.
-> And a tight, slick "Do"-api can be very valuable and is easier to write.

2011-02-19

Parallel collections in Scala



The other day I got inspired to check out the new parallel collection feature of Scala 2.9.
Since there isn’t a release yet I braced myself for some interesting time. However, I quickly found it much easier then expected.


Download and unpack the latest build of Scala 2.9 on scala-lang http://www.scala-lang.org/node/212/distributions. Since I didn’t want to mess with my current environment I decided to do the tests in REPL. To start a new Scala consol just go into the newly unpacked distribution and run there.


Then for something to do:
(1 to 1000000).toArray.map(_*2)
(1 to 1000000).toArray.par.map(_*2)
Turning on the parallel feature by slapping on a .par (or .toParIterator/.toParSeq … depending on collection type). Use Array in preference of List since an indexed collection is better suited for parallelization. Of cause, it is up to you to guarantee that what is done to the elements in the collection really can be done safely in parallel, but mechanics of it really is this simple!
However, when running it turned out that the parallel version took more time than the normal one?!


I tried again:
case class Call(i: Int) { 
  def call { Thread.sleep(100) } 
}

val arr = (1 to 100).toArray.map(new Call(_))
arr.map(_.call)

val parr = arr.par
parr.map(_.call)
Trust restored, twice the speed for the parallel version.


So what was the problem, why isn’t the parallel version always faster? The parallel collections implementation in Scala uses the fork/join thread-pool from Doug Lea’s concurrency enhancement to Java, and this isn’t introduced until java 7. If the fork/join thread-pool isn’t available scala reverts back to a normal thread-pool and if you try to do processor intensive things in threads only, the result tend to be “not so good”.


So, next step was to make use of all cores.
After some searching I found the project openjdk-osx-build on google code which continuously build binary releases of Java 7 for OsX and downloaded and installed the latest build. Then it was only a matter of changing the environment in “Java Preferences” to select Java 7 first and start a new REPL consol.


Scala REPL nicely report Scala 2.9 and Java 7, now both earlier examples are faster in parallel mode, the “memory intensive” example almost twice as fast and the “thread sleep” example up to 10 times as fast!


So now it is time to buckle up, start writing immutable objects and reap the benefits!


Runnable code at: 
https://gist.github.com/829556
And a demonstration: