« Posts under Technobable

How to configure Geb/Spock with Gradle

Geb/Spock + Gradle

Well, it turns out you have to use the right version of geb-core, geb-spock and┬áspock-core, not to mention the right version of groovy. The problem appears to be that that Geb/Spock integration jar (geb-spock:0.7.2) was built with using a slightly Groovy 1.8, and the 2.x series just hasn’t caught up yet. This means trying to get Geb/Spock on Gradle 2.x working just won’t work – you will get ClassNotFoundExceptions and an urge to pull your hair out. After digging around and trying various combinations I finally settled for Spock 0.6 on Gradle 1.8, Geb/Spock 0.7.2, and Geb 0.7.2. Note that the Geb/Spock integrations should run the same version. My gradle dependencies wound up looking like this:

dependencies {

	def seleniumVersion = "2.42.2"
	def phantomJsVersion = '1.1.0'
	def cargoVersion = '1.4.9'

	// selenium drivers
	compile "org.seleniumhq.selenium:selenium-ie-driver:$seleniumVersion"
	compile "org.seleniumhq.selenium:selenium-chrome-driver:$seleniumVersion"
	compile "org.seleniumhq.selenium:selenium-firefox-driver:$seleniumVersion"
	compile "org.seleniumhq.selenium:selenium-support:$seleniumVersion"
	compile("com.github.detro.ghostdriver:phantomjsdriver:$phantomJsVersion") {
		transitive = false
	}

	// geb
	compile 'org.codehaus.geb:geb-core:0.7.2'
	compile 'org.codehaus.geb:geb-spock:0.7.2'

	// spock
	compile 'org.spockframework:spock-core:0.6-groovy-1.8'

	compile 'junit:junit:4.8.2'
	compile 'org.slf4j:slf4j-log4j12:1.7.6@jar'
	compile 'org.slf4j:slf4j-api:1.7.6@jar'

}

I wanted to create a separate task just to run these Geb/Spock tests so did the following:

task acceptanceTest(type: Test, dependsOn: [compileTestGroovy]) {

	maxParallelForks = 5
	forkEvery = 5

	include 'com/something/acceptance/**'

	doFirst {
		println 'nStarting tomcat via cargo'
		tasks.cargoStartLocal.execute()
	}

	doLast {
		println 'nStopping tomcat via cargo'
		tasks.cargoStopLocal.execute()
	}

	def timestamp

	beforeTest { descriptor ->
		timestamp = new Date()
	}

	afterTest { desc, result ->
		logger.lifecycle("nn>>> Running " + "${desc.name} [${desc.className}]")
		println "Executed ${desc.name} [${desc.className}] with result: " +
			"${result.resultType} in ${new Date().getTime() - timestamp.getTime()}ms"
	}

}

Since my Geb tests are written in groovy, I’ve structured my project such that my acceptance tests are in the proper groovy directory, and now I can now run Geb tests just like regular Unit and Integration tests. Heck, I could even bundle cargo with it and have it run my application, fire up the Geb/Spock Tests and then shut down the app in one fell swoop. Final script looks like this:

buildscript {
	repositories {
		jcenter()
	}
	dependencies {
		classpath 'com.bmuschko:gradle-cargo-plugin:2.0.3'
	}
}

apply plugin: 'java'
apply plugin: 'groovy'
apply plugin: 'com.bmuschko.cargo'

repositories {
	jcenter()
	mavenCentral()
}

dependencies {

	def seleniumVersion = "2.42.2"
	def phantomJsVersion = '1.1.0'
	def cargoVersion = '1.4.9'

	// selenium drivers
	compile "org.seleniumhq.selenium:selenium-ie-driver:$seleniumVersion"
	compile "org.seleniumhq.selenium:selenium-chrome-driver:$seleniumVersion"
	compile "org.seleniumhq.selenium:selenium-firefox-driver:$seleniumVersion"
	compile "org.seleniumhq.selenium:selenium-support:$seleniumVersion"
	compile("com.github.detro.ghostdriver:phantomjsdriver:$phantomJsVersion") {
		transitive = false
	}

	// geb
	compile 'org.codehaus.geb:geb-core:0.7.2'
	compile 'org.codehaus.geb:geb-spock:0.7.2'

	// spock
	compile 'org.spockframework:spock-core:0.6-groovy-1.8'

	// cargo support
	cargo "org.codehaus.cargo:cargo-core-uberjar:$cargoVersion",
		"org.codehaus.cargo:cargo-ant:$cargoVersion"

	compile 'junit:junit:4.8.2'
	compile 'org.slf4j:slf4j-log4j12:1.7.6@jar'
	compile 'org.slf4j:slf4j-api:1.7.6@jar'

}

// == test configurations == //

task acceptanceTest(type: Test, dependsOn: [compileTestGroovy]) {

	maxParallelForks = 5
	forkEvery = 5

	include 'com/something/acceptance/**'

	doFirst {
		println 'nStarting tomcat via cargo'
		tasks.cargoStartLocal.execute()
	}

	doLast {
		println 'nStopping tomcat via cargo'
		tasks.cargoStopLocal.execute()
	}

	def timestamp

	beforeTest { descriptor ->
		timestamp = new Date()
	}

	afterTest { desc, result ->
		logger.lifecycle("nn>>> Running " + "${desc.name} [${desc.className}]")
		println "Executed ${desc.name} [${desc.className}] with result: " +
			"${result.resultType} in ${new Date().getTime() - timestamp.getTime()}ms"
	}

}

// == cargo configuration == //

cargo {
	containerId = 'tomcat7x'
	port = 8080

	deployable {
		file = file("target/path/to/application.war")
		context = "/"
	}

	local {
		installer {
			installUrl = 'http://archive.apache.org/dist/tomcat/tomcat-7/v7.0.54/bin/apache-tomcat-7.0.54.zip'
			downloadDir = file("tomcat/download")
			extractDir = file("tomcat/extract")
		}
	}
}

As you can see, before the Geb tests run, I invoke the cargoStartLocal task to fire up tomcat7, and I’ve configured cargo such that it will download tomcat7 from apache, extract the archive, and then deploy my war file on port 8080. Once the Geb tests complete, cargo will shut down the app, and my automated acceptance tests will be complete.

Happy testing!

Subverting foreign key constraints in postgres… or mysql

Temporarily disable key constraints?

On postgres (version 8.1, mind you) I ran across a scenario where I was had to update a set of records that carried foreign key constraints with other tables. I was tasked with updating this table, and the new data may could end up in a state with broken key constraints. The normal postgres replace function would not work as there was no natural regex replace that I could run that would affect all the entries the way I wanted without breaking FK constraints. Ultimately I had to break down my queries in such a way that that at the end of the transaction, the constraints would check out. It turns out that in postgres when you define a foreign key, you can flag it as DEFERRED:

ALTER TABLE tb_other ADD CONSTRAINT tb_other_to_table_fkey 
	FOREIGN KEY (tb_table_pk) REFERENCES tb_table (tb_table_pk) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY IMMEDIATE;

With the alter table command above we can then make use of this DEFERRABLE clause -this flag tells postgres that this constraint check may be deferred until the end of the transaction. The INITIALLY IMMEDIATE clause tells postgres the default constraint behavior is to check the constraint immediately, when the transaction attempts to perform the corresponding delete or insert. You can also flag the constraint to be INITIALLY DEFERRED. Initially deferring as you might guess tells postgres to check the constraint at the end of the transaction. I think generally if you want constraints though, you will probably want to check immediately. It’s good to know you have the option if you really need it though.

Once the foreign key constraint is set as deferrable, we can then execute a script like this to defer the constraint checks until the end of the transaction:

-- postgres deferred constraints in action
begin;

SET CONSTRAINTS ALL DEFERRED;

delete from tb_table;

insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);

commit;

Pretty useful in my opinion. I think I prefer this solution as opposed to disabling triggers across the table since disabling triggers is a schema change and you end up being responsible for restoring them once you’re done. Consider the following :

-- postgres disabled triggers
begin;

ALTER TABLE tb_site DISABLE TRIGGER ALL;

delete from tb_table;

insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);

-- make sure to restore the triggers
ALTER TABLE tb_site ENABLE TRIGGER ALL;

commit;

In this implementation you end up altering the schema to disable all the triggers associated with this table. Don’t forget to re-enable the triggers at the end of the transaction of the disabling will remain in place. Another thing to consider is if you have auditing type of triggers on your target table, you will then end up having to manually fire those triggers or run the appropriate clauses to perserve the original trigger’s integrity. This kind of thing could quickly turn into quite the problem if not handled correctly.

Mysql’d keys

The mysql approach handles this case very similar to the disabled triggers – instead, it uses a system variable called FOREIGN_KEY_CHECKS that can be toggled on or off:

 

-- mysql key constraint supression
begin;

-- lift 
SET FOREIGN_KEY_CHECKS=0;

delete from tb_table;

insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);

-- put back when you're done
SET FOREIGN_KEY_CHECKS=1;

commit;

As you can see it’s a very similar approach to the trigger disable in postgres. From the documentation at the time of this writing (mysql version 5.5 – Deferred Foreign Keys in MySql) it looks like deferred keys are just not an option in mysql even though it’s listed as a standard. Worthy of notice.

References:
Postgres Set Constraints
Postgres Create Table documentation

Run a huge query as fast and safely as possible

Use this as a last resort

Queries that take a long time are generally a bad thing. If your application requires these kinds of measures to perform its duties, then chances are you really need to revise your table structures and/or you queries – ideally these queries should take seconds at the most, while data warehouse type reporting queries should be on the order of minutes. That said, sometimes you may need to update your entire schema, delete columns on a table with millions of records, or run a stored proc that goes in and cleans up data across several sets of tables across untold numbers of rows. If you try to run it from putty, or any other remote terminal and anything happens that might sever your connection, you might end up SOL with a rolled back exception that would leave you exactly where you started – with no data updated. These are some strategies you can use to mitigate the risk and cut down on the query time.

Try different strategies

Consider running a proc that pulls one million records, and then updates each record individually – you might want to get some popcorn since that update might take a while. That kind of update is a linear approach and generally bad because it will need to sequentially go through each record one at a time. Divide and conquer might work better – you could try batch updates across segments of the table where indexes are used – something like:

update table set column = value where constraint = 'arbitrary value';
update table set column = otherValue where constraint = 'some other value';

Another approach could be to reconstruct the table using the data from your target table, while filtering out or substituting in hardcoded values for the data you want to replace:

insert into clone_table
select primary_key, column, now() as activated_date, 
	other_column, true as is_active
from table 
where status = 'active'

You could use this approach to reconstruct your table with the data you want and then swap table references on the foreign keys. That part might get a little tricky but if you do it right, using select insert could end up saving you quite a bit of time – select inserts could take minutes while updates could take orders of magnitude longer.

Use screen to wrap your remote session

If your database is running on unix, without a doubt you’ll want to use screen if you need to run a very long query. If your database is on linux, I’m not sure there’s an equivalent. Anyone that’s used putty or some other terminal type of remote console app knows what it’s like to have some long running process terminate prematurely because the connection was severed, or your computer crashed. Screen saves you from those infrequent occurrences by creating an emulated session that can be detached/re-attached such that if you do get disconnected, you can go back and pick up where you left off. It’s very handy for executing a long running process where a disconnect would either cancel the proc and normally terminate the session.

To invoke screen, just type the word screen into the command prompt:

[root@bedrock ~]# screen

This will start your screen session. This may or may not provide some relevant information at the bottom of the screen like in the example below depending on your flavor of unix or configuration:

[root@bedrock ~]#

[ bedrock ][ (0*bash) ][2011-09-09 21:57 ]

Now that screen is up, you can disconnect your terminal app without fear that your screen session would terminate prematurely. You can then log back into the unix box and get a listing of all the current screen sessions with the following command:

[root@bedrock ~]# screen -ls
There are screens on:
     27470.pts-0.bedrock (Attached)
     8177.pts-0.bedrock (Detached)
     mySessionName (Detached)
3 Sockets in /var/run/screen/S-agonzalez.

I should point out that the session name is organized like [processId.sessionName]. You can name your session upon creation with the following command:

[root@bedrock ~]# screen -S yourSessionName

Once you’ve found the right screen session (they’re listed by session name) you can re-attach your severed session with the following command:

[root@bedrock ~]# screen -r mySessionName
There are screens on:
27470.pts-0.bedrock (Attached)
8177.pts-0.bedrock (Detached)
2 Sockets in /var/run/screen/S-agonzalez.

Once you’re in screen it’s useful to know a few keyboard commands to get around:

Control+c, pause then Control+d Detaches your session without termination
Control+c, then h screen capture, and save to your home directory as hardcopy.x (x being the number)
Control+c, then C (capital c) clear the screen of text
Control+c, then N (capital n) display information about the current screen window
Control+c, then ? help screen!

You can find more command, options and details at this screen manpage.

Run your query through a local pipe

If your query pulls back a lot of data, its going to require bandwidth to pipe it all back to your remote client. Don’t use remote clients (lik pgAdmin, MySQL workbench, SQuirreL etc) unless you’re running them directly on the box that’s running your database. Connect remotely and log in through a local pipe however you’re supposed to connect to the local command line:

[root@bedrock ~]# psql -l username my_pg_database
Welcome to psql 8.1.21, the PostgreSQL interactive terminal.

Type: copyright for distribution terms
h for help with SQL commands
? for help with psql commands
g or terminate with semicolon to execute query
q to quit

ml_publisher=#

You would be amazed how much faster a query runs when you’re running it directly on the machine. To give you an idea – running an update across 2 million rows might take an hour if you’re running from a remote client, while running it directly on the box might take mere minutes. We’re talking orders of magnitude of performance – for most development remote is perfectly fine, but for heavy lifting you can’t beat a local pipe.

Now you can run your query… If you’re running on a local pipe and you’re running on screen, you should be able to sever your screen connection without terminating your super long query. Let’s hope that query doesn’t table lock everything to kingdom come!

Configuring Data Sources, JBoss 7

Yep it’s gonna be a big year for JBoss AS 7

This will be the first in a series I’ll be writing on JBoss’ new application server version 7. Lately I’ve been playing around with JBoss AS 7 recently, and all I can say is.. !@#%, NICE! I downloaded 7.0 with the expectation that it would honor a lot of the previous version’s overall approach and layout. I was in for a BIG surprise. It comes off as a total rewrite, leveraging a lot of the latest and greatest technologies and frameworks – things like Weld (an implementation of the context and dependency injection spec – JSR-299), OSGi (the Open Services Gateway initiative framework for the uninitiated), Hibernate, and RESTeasy.

I’ll say the guys over at JBoss certainly delivered. Before, server start up times could take a respectable 30 seconds to a minute or more depending on your deployment structure and dependencies. Now you ask? Less time than my 15 second ant build script! Right now I’m clocking 14 second from cold to deployed on my smaller sized application. With AS 5, the same deployment was taking something like a minute. Hat’s off guys, you all at JBoss really did some work!

The first thing and arguable the most difficult thing you’ll want to do is set up the data source for your deployment.

Configuring the Data Source

Before we had to configure out postgres-ds.xml file with all the data source metadata required to configure out application. The process now isn’t as straight forward – there are three ways to do it, two if you don’t count using the really nice console manager it ships with. I should mention that now there are 2 types of configuration setups 1) domain and 2) standalone. Standalone is the model we’re most familiar with – a single instance acting as a single server. Domain on the other hand is geared for a clustered style of deployment – although its way more flexible than that. More on this in another article. For the sake of simplicity, lets start with the standalone type.

Place the jdbc driver

There are 2 ways to do this. The first is really straight forwards – just stick your jdbc jar file in the deployment folder indicated in the configuration file:

jboss-7.0.0.GAstandaloneconfigurationstandalone.xml

Relevant contents:

<subsystem xmlns="urn:jboss:domain:deployment-scanner:1.0">

	<deployment-scanner name="default" 
		scan-enabled="true" scan-interval="5000" 
		deployment-timeout="60"
		relative-to="jboss.server.base.dir" 
		path="deployments" />

</subsystem>

Stick your jdbc jar file in here, and JBoss will automatically configure your standalone.xml file for you. BTW, this deployment-scanner entry maps the location of the deployments directory:

jboss-7.0.0.GAstandalonedeployments

Where jboss.server.base.dir points to the “standalone” directory and path maps the name of the deploy folder “deployments”.

The second way is a more complex and so requires a little bit more legwork. JBoss has completely changed its class loading strategy, and if you’ve ever worked with Maven repositories it might feel very familiar. Essentially jboss’ modules folder is where all the jars that are used by the jboss server live. By separating them into a separate classpath, you won’t run into weird classpath errors when there are competing jar files/versions deployed by your application. This problem exposed itself in earlier versions of jboss – in particular with the xml jars. If you had a mixed case of xml libraries, jboss might have been using an older version that could override your application’s newer version – hard to track down if you don’t know where to look. Anyway, these jar files are organized by psuedo packages – just like maven repositories except the final folder is called main. Each module jar file must be placed there and be paired with corresponding a module.xml file. For example you’d want to create a folder in your install like this:

jboss-7.0.0.GAmodulesorgpostgresqlmain

Here is an example of module.xml:

<?xml version="1.0" encoding="UTF-8"?>
<module xmlns="urn:jboss:module:1.0" name="org.postgresql">
  <resources>
    <resource-root path="postgresql-9.0-801.jdbc4.jar"/>
  </resources>
  <dependencies>
    <module name="javax.api"/>
    <module name="javax.transaction.api"/>
  </dependencies>
</module>

You’ll want to map the name of the jdbc driver, as well as the name of the module name here – we’re going to map it to the configuration next. Once this is squared away, we’ll want to configure the standalone.xml file:

jboss-7.0.0.GAstandaloneconfiguration

Map and Configure

In standalone.xml, you’ll want to look for the <subsystem xmlns=”urn:jboss:domain:datasources:1.0″> node and add a shiny new configuration like this:

<subsystem xmlns="urn:jboss:domain:datasources:1.0">
	<datasources>
			<datasource jndi-name="java:jboss/DefaultDS" enabled="true" 
				jta="true" use-java-context="true" use-ccm="true"
				pool-name="postgresDS" >
			<connection-url>
				jdbc:postgresql://localhost:5432/database?charSet=UTF-8
			</connection-url>
			<driver>
				org.postgresql
			</driver>
			<transaction-isolation>
				TRANSACTION_READ_COMMITTED
			</transaction-isolation>
			<pool>
				<min-pool-size>
					10
				</min-pool-size>
				<max-pool-size>
					100
				</max-pool-size>
				<prefill>
					true
				</prefill>
				<use-strict-min>
					false
				</use-strict-min>
				<flush-strategy>
					FailingConnectionOnly
				</flush-strategy>
			</pool>
			<security>
				<user-name>
					username
				</user-name>
				<password>
					password
				</password>
				</security>
			<statement>
				<prepared-statement-cache-size>
					32
				</prepared-statement-cache-size>
			</statement>
		</datasource>
		<drivers>
			<driver name="org.postgresql" module="org.postgresql">
				<datasource-class>
					org.postgresql.Driver
				</datasource-class>
			</driver>
		</drivers>
	</datasources>
</subsystem>

Pay attention to:

	<driver>
		org.postgresql
	</driver>

Note: you can set this to the jdbc driver file name if you’re using the deploy approach. In fact, jboss will be more than happy to write the driver configuration for you if you deploy the driver from the deploy directory.

This entry maps to the driver configured directly below to the driver name configured by :

	<driver name="org.postgresql" module="org.postgresql">
		<xa-datasource-class>
			org.postgresql.Driver
		</xa-datasource-class>
	</driver>

The name property maps the driver to the configuration, and the module property maps to the module we laid out in the first step. I’ll point out that it seems that you need to use a transaction aware data source. I think you’re supposed to be able to use the node </datasource-class> with the regular driver class but when I tried this, I got xml parsing errors – it doesn’t seem to think “datasource-class” is a legal element.

You can call on the data source file through the jndi handle configured on the datasource node: jndi-name=”java:jboss/DefaultDS”. The rest of the properties and nodes configure various settings for your datasource, and if you’ve worked with them before you will probably be familiar with them already. If you need a refresher (like me) you can also look through all the JBoss user guide documentation.

References:
JBoss Wiki on Datasource configuration
JBoss user guide documentation
JBoss Wiki Getting Started Guide
JBoss Getting Started Admin Guide

Configure ssh authorized keys for cvs access

Continous Integration

Lately I’ve been working on adding Hudson as the continuous integration (CI) server for projects at work. The whole notion of CI merits an entire discussion, but suffice to say it’s a very clean, approach that helps automate the build process particularly if you run manual builds that use prompted shell scripts.

After looking at a few solutions, Hudson seemed from many accounts to be the easiest to get running, and pretty flexible when integrating into an existing build system. Add to the resume that it could run in a servlet container, divorcing environmental configuration from its automated build functionality, and we suddenly have a winner.

I went to work setting up integration build scripts and projects and all kinds or cool plugins when I finally hit a wall when it came time to wire up Hudson with cvs access. As it turns out, in our particular setup we access cvs via ssh, and ssh will usually require a password in order to connect to a remote host. When automating builds, this can be quite problematic since it seems that part of the argument is to allow the builds to fire off without interactive human intervention. I noticed that prompted passwords are very capable of raining that parade out.

I dug around for what seemed like forever, until it seemed that the solution was to enable authorized key access via ssh, and configure the generated public key to not require a pass phrase. In a nutshell, you can set up a public and private key, and configure it to require a pass phrase or not when requesting access. You then copy that public key to the remote machines you want to enable access to into the correct location. The last step is to configure authorized key access via ssh on the remote machine. Only then will you be able to ssh to the remote machine with the public key and without a password or pass phrase – in essance that public key becomes trusted authentication.

Here are the steps, with more detail:

Configure your connect-from machine

Let’s assume you’re going to use an account called builder for this example. In your shell as builder, cd into ~/.ssh and run:

ssh-keygen -f identity -C ‘buildier identiy cvs key’ -N ” -t rsa -q

This will create the set of keys for you without a pass phrase. The -C flag sets the comment tagged at the end of the key. You want to end up with a file structure like this:

[builder@connectFrom.ssh]# ls -l iden*
-rw——- 1 jboss CodeDeploy 1675 Dec 5 09:54 identity
-rw-r–r– 1 jboss CodeDeploy 405 Dec 5 09:54 identity.pub

on your connect-from machine. You will need to chmod the user’s home and .ssh directories to permission 0700. It turns out that these folder permissions are very picky and these keys will not work if the group or others have read/write access to that .ssh directory or its contents.

Configure your connect-to machine

You will now want to again create a ~/.ssh directory, also with permissions set to 0700 on the connect-to machine. Then use your favorite text editor to create the file: ~/.ssh/authorization_keys. This one’s even more strict – ensure that ~/.ssh/authorized_keys permissions is set to 0600. Paste the contents of your connect-from machine’s file ~/.ssh/identity.pub into this authorized_keys file contents. This step essentially copies the public key over as an authorized key to the remote machine. The file authorized_keys should have only one key per line, or it will cause problems. Lastly, we’ll need to make sure that the flag PubkeyAuthentication is enabled on the connect-to machine and that it it reads in the correct authorized_keys file.

Edit the file /etc/ssh/sshd_config file and uncomment the following:


PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys

Test
Now you should be able to test the ssh connection with debugging enabled by saying form the connect-from machine’s shell:

[builder@connectFrom.ssh]ssh -v builder@connectTo

You should see connection information useful for debugging – looking for something like this:

debug1: Next authentication method: publickey
debug1: Offering public key: /home/builder/.ssh/identity
debug1: Server accepts key: pkalg ssh-rsa blen 277
debug1: read PEM private key done: type RSA
debug1: Authentication succeeded (publickey).
debug1: channel 0: new [client-session]
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env LANG = en_US.UTF-8
Last login: Thu Dec 2 01:17:40 2010 from connectFrom
[builder@connectTo~]$

Configure Hudson to use the external ssh
Now that these authorized keys have been configured for use, you can go into Hudson and set up the cvs connection string. You will need to make sure that the cvs advanced configuration is set to :

$CVS_RSH: ssh

And you should be all set.

Your builder account should now be able to access the remote machine using the trusted authorized keys.

Resources:
How to allow SSH host keys on Linux (Fedora 10 & CentOS 5.2)
ssh – authorized_keys HOWTO
2.4.1 Connecting with rsh and ssh

Write a Stored Procedure in Postgres 8+

Stored Procs

Sometimes as a developer we’re tasked with data intensive work like importing data into a database, cleaning up sets of incomplete records or transferring data from one table to another through some kind of filter. While our application would normally be in charge of creating and maintaining the data, sometimes we don’t want to end up writing an entire module or mini application to address these tasks. Since they’re data intensive, a stored procedure might be a good approach to take. Stored procedures are a type of program written using a more robust version of sql (structured query language) that allows for the manipulation of data records directly within a database environment.

If we were to write the equivalent code using a layer written in java, .net, or php, there would be a lot of overhead cost in terms of processing power and performance – orders of magnitude more. As data is processed, results would normally be returned to that calling layer and shuffled around that layer’s memory, essentially adding another step to the process. If we make these changes as close to the data as possible, we’ll be able to squeeze as much performance as possible and suffer the least amount of overhead. Just for perspective here’s an example: a 1 gigabyte file could take several hours to import using java business logic, while a stored proc could take less than half an hour. Mileage may vary of course, but that’ll give you an idea of the performance cost you could save with data intensive tasks like that. A word of caution though: I’m not saying a stored proc is the way to go for your entire application; it’s merely a tool that can be used in your arsenal to get the job done with the most efficient means possible.

Example

Here’s an example of a generic stored proc written in psql (postgres version).

CREATE OR REPLACE FUNCTION example_stored_proc() RETURNS void AS $$ 
DECLARE 
     userRecord record; 
     user_property_id bigint;
BEGIN 
     FOR userRecord IN  
          SELECT * FROM tb_user u ORDER BY u.user_id 
     LOOP 
          SELECT INTO user_property_id nextval('sq_user_property'); 

          -- user_property_id now has a value we can insert here
          INSERT INTO tb_user_property VALUES(
                    user_property_id ,
                    'user_id',
                    userRecord.id
          ) ; 
 
          IF userRecord.email like 'user@domain.com' THEN

                    update userRecord set email = 'user@other-domain.com' where id = userRecord.id;

          ELSEIF userRecord.email is null THEN

                    update userRecord set active = false where id = userRecord.id;

          ELSE

                    RAISE NOTICE 'didn't update any record';

          END IF;

          RAISE NOTICE 'added property for user id: %', userRecord.id; 
       
     END LOOP; 
     RETURN; 
END; 
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION example_stored_proc() RETURNS integer AS $$
CREATE OR REPLACE FUNCTION will create the stored proc in the database. RETURNS declares the data type returned at the end. This example returns an integer, but a record or a result set may also be returned. The text in between the two pairs of $$ is the body of the procedure.

DECLARE
This keyword initializes the variables the stored proc will be using. It essentially lets the database know to allocate memory for use.

BEGIN
This marks the beginning of the stored proc logic. It naturally ends with END.

FOR userRecord IN
SELECT * FROM tb_user u ORDER BY u.user_id
LOOP

– – do stuff

END LOOP;

This is the basic looping structure used in psql. Notice the loop is built around a straight forward sql query – here is where the magic happens. The looping variable in this example is “userRecord” – it holds the currently fetched data record and allows you to manipulate it for your own means in the body of the loop. So, if you wanted to insert the value of userRecord.id into a table, you could just stick in the insert statement as a variable as shown in the insert statement in particular loop’s body.

SELECT INTO

Using this construct allows you to create a temporary table to hold query results for later use. Your variable can be a record or a single column value. In order for it to work you need to declare the variable that’s going to take the value in the DECLRARE section of the stored proc. Inline variable declaration is not supported.

Conditionals

As expected, the IF/THEN/ELSEIF/ELSE/END IF construct can be used to create conditional sequences of logic. The conditionals need to be any kind of expression postgres can evaluate. The ELSEIF can be used to wrap secondary conditionals, while the ELSE of course is the default if no other conditions are met. Fairly self explanatory.

RAISE NOTICE

This is your standard psql logging output statement. The text in the single quotes is output to the console/message window, and every “%” is substituted with the ordered value after each comma in the statement. So, in this case “userRecord.id” is substituted into the first % to appear in the output text. If you wanted to have multiple values output you could construct your RAISE NOTICE like this:

RAISE NOTICE 'this is record % out of 1000, and its value is %', record_number, record_value; 

It would substitute record_number into the first % and record_value into the second % appearing in the text.

XML, Xalan, Endorsed dirs and &..

So recently, we’ve been working on a project that makes use of OpenSAML. As it turns out OpenSAML required newer Xalan libraries (2.7.1 to be precise), the kind that don’t ship with the older incarnation of jboss we are using for the project – version 4.02. Some of you might be more familiar with the jboss system properties and will know there’s a property jboss used specifically to override the standard xml libraries that ship with the jdk/jre (entry in the console output bolded and marked with a *). Jboss will allow you to pass in as a parameter the location for a variable known as “java.endorsed.dirs.” The purpose of this property is to map the file path to the Xalan libraries you would like for jboss to use as the Xalan implementation during runtime.

-Djava.endorsed.dirs=file://path/to/your/xalan/libraries

So if you have other installed applications running in different instances, you wont have to upgrade every instance you’re running concurrently, instead you can override a specific instance’s use of Xalan libraries by using this parameter in the run script. I’m not quite sure what version of Xalan ships with jboss 4.02, but when we upgraded the first thing we noticed was that any xml text in like “&amp;” rendered as “&amp;amp;” post xslt instead of rendering as “&” (presumably a fix set forth as Xalan matured):

<xsl:param name="url">
	http://www.some-url.com/path.do?parameter=value&otherParameter=otherValue
</xsl:param>

turned into

http://www.some-url.com/path.do?parameter=value&amp;amp;otherParameter=otherValue

If you intend to upgrade your Xalan libraries I would think that you might need to do some type of regression testing to make sure upgrading these xml centric libraries doesn’t inadvertently wind up breaking xml dependent sections of your application. It should be noted if you randomly toss upgraded xalan jars into your application you’re bound to run into all kinds of crazy exceptions. I’ve seen jboss complain about login-conf.xml, missing class libraries, weird servlet allocation exceptions, class not founds – all kinds of misleading problems that seem unrelated to xalan jar collisions or wierded out dependancies.

Bottom line is if you need to upgrade Xalan, stick to using java.endorsed.dirs, and pass in the -Djava.endorsed.dirs param into the jboss run script if you want to override a specific instance.

Java, XML and XStream

What’s an object/xml serializaing/deserializaing library?

If you’ve never worked with an object/xml serializer and are considering writing your own from scratch, you may want to consider using a library like XStream. XStream is very good at moving java into xml and back. It allows a high level of control over how the xml can be organized and structured and even allows the user to create their own converters for even more flexibility.

But still, why use something like this when you can be perfectly happy writing your own data conversion scheme? The problem really boils down to flexibility, and in reinventing the wheel. Ninety percent of the time you’re already interrogating a datasource (like some rdbm system like oracle, postgres or mysql) and will be using some kind of TransferObject or maybe an Entity persistence scheme built around pojos. If you write your own serializing engine from scratch by mapping pojos to dom4j nodes, constructing Document objects and then using them for stuff like xsl transformations, you end up missing out on a great tool.

It may not seem obvious right now, but a homegrown serializer is the kind of thing you can write once and forget about and then months or years down the line, when it comes time to update your data model or expand its framework, you end up rebuilding all the dom4j stuff from scratch. Unless you take the lazy route and append and new xml to the root node to save yourself the entire node refactor. Maybe simple objects with one or two simple nested objects wont seem like much, but if your object becomes anything close to approaching a complex lattice, then going back and tweaking the entire structure when you want to expand or refactor your xml can become quite perilous. Especially if you want to make your xml as xpath friendly as possible.

Edit:
As Felipe Gaucho has been kind enough to point out, Xstream only writes a text string as the serialized object. It will not preform any validation on your XML, so you’re left on your own validate it post serialization. Something like JAXP comes to mind to tackle XSD based validation, or JiBX if you’re looking for Data Binding.

So what does XStream do for me?

Consider these objects:

 public class MyClass {

	protected MyObject object;
	
}
public class MyObject {

	protected ArrayList Field;
	
}

XStream lets you do something like this if you want to serialize an object like MyClass to xml:

 XStream xstream = new XStream();
String myClassXML= xstream.toXML(myClassObject);

and if you want to go from xml back to a java object you can do this:

 XStream xstream = new XStream();
MyClass myClassObject= xstream.fromXML(myClassXML);

As you can see, all the plumbing goes away and you are now free to concentrate on writing the rest of your application. And if you want change your object model, consolidate nodes or rearrange the structure of your xml, all you have to do is update your pojo and your xml immediately will reflect the updated changes in the data model on serialization.

It should be noted that to completely deserialize xml, your object needs to correctly map all the data in the xml. If you have trouble deserializing try building a mock object and populating it with sample values and then serialize it to xml; then you can compare the test xml to what your actual xml is and make your changes.

Alilasing

XStream does not require any configuration, although the xml produced out of the box will likely not be the easiest to read. It will serialize objects into xml nodes according to their package names, usually making them very long as we can see from the following example:

<com.package.something.MyClass>
	<com.package.something.MyObject>
		<List>
			<com.package.something.Field/>
			<com.package.something.Field/>
		</List>
	</com.package.something.MyObject>
</com.package.something.MyClass>

Luckily XStream has a mechanism we can use to alias these long package names. It goes something like this:

XStream xstream = new XStream();
xstream.alias("MyClass", MyClass.class);
xstream.alias("MyObject", MyObject.class);
xstream.alias("Field", Field.class);

Adding an alias like this will let your xml come across nice and neat like this:

 <MyClass>
	<MyObject>
		<List>
			<Field/>
			<Field/>
		<List>
	</MyObject>
</MyClass>

Attributes

If you want to make a regular text node an attribute, you can use this call to configure it:

 xstream.useAttributeFor(Field.class, "name");

This will change make your xml change from this:

 <MyClass>
	<MyObject>
		<List>
			<Field/>
				<name>foo</name>
			<Field/>
		<List>
	</MyObject>
</MyClass>

into

 <MyClass>
	<MyObject>
		<List>
			<Field name="foo"/>
			<Field/>
		<List>
	</MyObject>
</MyClass>

ArrayList (implicit collections)

ArrayLists are a little tricker. This is what they look like out of the box:

 ...
	<MyObject>
		<List>
			<Field/>
			<Field/>
		</List>
	<MyObject>
...

Note theres an extra “List” node enclosing the List elements name “Field”. If we want to get rid of that node so that Field is right under Object, we could tell XStream to map an implicit collection by doing the following:

 xstream.addImplicitCollection(MyObject.class, "Field", "Field", Field.class);

where the addImplicitCollection method signature is the following:

 /**
	 * Appends an implicit collection to an object for serializaion
	 * 
	 * @param ownerType - class owning the implicit collection (class owner)
	 * @param fieldName - name of the field in the ownerType (Java field name)
	 * @param itemFieldName - name of the implicit collection (XML node name)
	 * @param itemType - item type to be aliases be the itemFieldName (class owned)
	 */
	public void addImplicitCollection(Class ownerType,
            String fieldName,
            String itemFieldName,
            Class itemType) 

Adding this implicit collection configuration will streamline the xml so that it looks like this now:

 
<MyClass>
	<MyObject>
		<Field/>
		<Field/>
	</MyObject>
</MyClass>

Notice the “List” node is gone, and “Field” is now directly under “MyObject”. You can find the complete documentation on the XStream website here.

There are plenty of more tricks you can use to configure/format your xml, and there are plenty of examples listed on the XStream website, but these three points here should cover the basics to get you started.

5 ways to make XML more XPath friendly

As java developers we should always do what we can to optimize outbound xml from our side of the fence. I mean, its our job to build and design awesome, elegant and efficient software whenever possible right? We have our data and we want to serialize it into xml, how can we make our xml as efficient and xpath friendly as possible?

1) Keep the structure simple

Consolidate data nodes whenever possible before marshaling your object into xml. You really don’t want to have to resort to using xpath for any unnecessary lookups across nodes. Keep those lookups confined to the persistence layer where they belong. I realize this may not always be possible, but keeping it to a minimum will really help the xsl developers by not forcing them to create ridiculous xpath expressions in order to link discretely separate data nodes. We want to keep the xpath logic as simple as possible.

Consider the following xml document:

<root>
	<house id="1" color="white">
	<room name="livingRoom" houseId="1" windows="3"/>
	<room name="kitchen" houseId="1" windows="2"/>
	<room name="bedroom" houseId="1" windows="4"/>
	<room name="bathroom" houseId="1" windows="1"/>
</root>

If this is how your xml is structured, and you wanted to transform this data with xsl you are creating all kinds of extra work and could end up causing an unnecessary performance bottleneck for your application. If you wanted to do something like lay out the total number of rooms for the house with id of 1, your xpath would have to look something like this:

count(/root/house[@id=1]/room)

You are now making your xsl developer implement logic to count all the rooms for a particular house node, using an xpath function and selector conditional logic to filter a global count. Just because you can use the count function does not mean you should use it every chance you get. This xpath expression will traverse all of your xml and count the number of nodes whose house node is 1, and return the total number of room nodes. Its not much of a problem if your xml is one or two house nodes deep, but what if you had a like 10, 20 or even 30 house nodes or more? If you are processing numbers like these, and then you span these node traversal across say a hundred requests, you would be doing something like 3,000 traversals. What if instead you used an xml structure like this:

<root>
	<house id="1"  totalRooms="4" color="white">
		<room name="livingRoom" windows="3"/>
		<room name="kitchen" windows="2"/>
		<room name="bedroom" windows="4"/>
		<room name="bathroom" windows="1"/>
	</house>
</root>

In this example we attached the room count to the house node as an attribute. This way, our xpath expression ends up looking like this:

/root/house/@totalRooms

No count function, no selector conditional logic to filter; you end up with a simple, basic xpath expression. You’re doing a single lookup instead of an entire 3,000 node traversal while collecting a node count that has to be calculated as the transformation is processing. Let the data/persistence layer perform those types of counts and populate all the data, and let your xsl/xpath lay out your data. Keep the structure simple. If this is not possible, you might be doing something wrong.

2) Avoid namespaces whenever possible

Namespaces are important, yes. But if you are trying to express a java object as xml, prefer using a different name altogether than attaching namespaces. This really comes down to creating a more specific, descriptive naming scheme for your application’s objects. If you are using namespaces just for the heck of it, I’d urge you not to. You end up adding lots of unnecessary noise to your xpath, and having those colons all over the place in your xml can make the document look like its been processed through a cheese grater. Anyone trying to read your xml will cringe at the thought and will want to pass it off to the new guy as part of the hazing ritual. Consider the following xml:

<root>
	<myHouseNamespace:house id="1"  totalRooms="4" color="white">
		<myHouseNamespace:room name="livingRoom" windows="3">
			<myHouseNamespace:furniture name="sofa" type="leather"/>
			<myHouseNamespace:furniture name="table"/>
			<myHouseNamespace:furniture name="lamp"/>
		</myHouseNamespace:room>
		<myHouseNamespace:room name="kitchen" windows="2"/>
		<myHouseNamespace:room name="bedroom" windows="4"/>
		<myHouseNamespace:room name="bathroom" windows="1"/>
	</myHouseNamespace:house>
</root>

This makes your xpath look like this:

/root/myHouseNamespace:house/myHouseNamespace:room[@name=’livingroom’]/@windows

I don’t know about you, but this xpath expression is hard to read with all the “myHouseNamespace:” junk all over the place. And we only went 2 nodes deep into the tree. A third level down would have marched the xpath expression across the width of this blog! Who loves mile long xpath expressions that add side scrollbars to your text editor on your widescreen monitor? NO ONE.

3) Use attributes wherever appropriate

There is really no difference between using an attribute and a single text node to express a simple value in xml. In other words, there is nothing different between this:

<root>
	<house>
		<id>1</id>
		<totalRooms>4</totalRooms>
		<color>white</color>
	</house>
</root>

and this:

<root>
	<house id="1"  totalRooms="4" color="white"/>
</root>

So why prefer attributes? Because it makes your xml document legible. Having a ton of single child text nodes adds a lot of noise to your document especially if they are digit or single word attributes. It becomes easy to mix up your text values with real, slightly more complex nodes of data in your xml tree. Readability is important and keeping your xml as efficient expressed as possible without sacrificing readability is paramount to making it understandable and easier to work with.

4) Clearly name your xml nodes

Use names that actually describe the data being represented. Stay away from these kinds of names:

<root>
	<h id="1"  totalRooms="4" color="white">
		<r name="livingRoom" windows="3">
			<f name="sofa" type="leather"/>
			<f name="table"/>
			<f name="lamp"/>
		</r>
		<r name="kitchen" windows="2"/>
		<r name="bedroom" windows="4"/>
		<r name="bathroom" windows="1"/>
	</h>
</root>

Sure, your xpath will be much shorter :

/root/h/r[@name=’livingroom’]/f[@name=’sofa’]/@type

But unless you’re very, very comfortable with the single letter naming convention, you might end up having a hard time keeping track of all the nodes since they’re small and easy to overlook. Descriptive, concise names help make your xml easier to learn or come back to if the data is named in a clear, self explanatory way. Ideally, your xml should have names that the untrained eye should be able to pick up, and make sense of the basic structure with little preparation.

5) Make your xml as human readable as possible

This point encapsulates all the others. XML is a language meant to be tied very closely to data, and our ability to understand that data will allow us as developers to mold it into whatever we want it to be. If you’ve ever had to sit through a very complex piece of xml, you’ll realize that forcing anyone to have to muck through an unconventional structure and non-intuitive names ends up breaking up the xslt development into three phases:

1) figuring out/understanding the xml
2) implementing the xsl
3) figuring out/understanding the xml, and what was just implemented

The longer the developer has to sit there and meddle with #1, and #3, the more time is lost in bringing your product to the next release. We want to spend as little time as possible figuring out or compensating for poorly structured xml so the real implementation work can be completed, and we can move on to the next thing.

In other words

The bottom line is if you structure your xml so that its easy to understand by humans, and it doesn’t cut corners by passing data lookups or counts to the xsl developer, your application will become much more efficient, well written, and easier to work with. This is a good place to exercise the separation of concerns principle, let the data/persistence layer do what it does best, and let the xml/xslt layer do what it does best. XSLT is a relatively expensive process, but even with infinite computing resources we should always strive to make the most of out whatever resources we can allocate.

Basic Ant scripts

What’s an Ant script? Do I need bug spray?

Ant is a scripting tool commonly used to build, compile and deploy projects. This is in no way an all encompassing inventory of what Ant can do. It is extensible and its instructions are expressed in an xml format whose nodes comprise a framework of abilities designed to make menial tasks automated.

From a developer’s perspective, the most basic Ant tasks are the compile, package and copy tasks. All java projects must do these three things many, many, many times during a development cycle. Its very boring and tedious if you do it by hand through the command prompt. First you’d have to type in a bunch of commands like “javac [ options ] [ sourcefiles ] [ @argfiles ]” detailing all the class paths you want to use, all the source files, and all then the other supporting parameters you need to enter to get it to compile your program correctly. If you’re only writing one class, its probably not that bad. But when you have hundreds of classes, multiple projects and dependencies, and a slew of directories to configure and lay out for compiling, it quickly becomes ridiculous. In fact, I would claim that it becomes ridonkulous.

Ant lets you define tasks that break up these chores into a series of chained events. An Ant task is broken up into what are called “targets”. Each target is meant to perform one task, or unit of work. If we break up the compilation/deploy process it could look something like this:

  1. clean out the scrub/temporary directory
  2. compile all the java files into class files
  3. package up all the class files into a jar file, or some kind of deployable artifact
  4. copy the new jar file to the deploy directory

We can define each one of these steps with an Ant task. This is a good thing because it allows us to chain them like stairs, one task leading into the next. If any one of the tasks fail, the script would fail and Ant would tell us where the problem happened (with line number, and the exact problem or exception).

Here are what these steps might look like:

1) Clean up the build directories

<!-- *************** -->
<!-- * Preparation * -->
<!-- *************** -->

<target name="prepare" depends="clean">
	<mkdir dir="${build.dir}"/>
	<mkdir dir="${build.dir}/jars"/>
	<mkdir dir="${build.dir}/openscope"/>
</target>

Here, the “clean” depends attribute references a previous ant target that deletes all these scrub directories. This “prepare” target creates the scrub directories we’re going to use in our build. mkdir creates a directory.

2) Compile all the java files into class file

<!-- *************** -->
<!-- * Compilation * -->
<!-- *************** -->	

<target name="compile" depends="prepare">
	<javac destdir="${build.dir}/openscope"
			debug="on"
			deprecation="on"
			optimize="off">
		<src path="${src.dir}"/>
	<classpath refid="build.classpath"/>
	</javac>

	<copy todir="${build.dir}/openscope">
		<fileset dir="${props.dir}">
			<include name="*.properties"/>
		</fileset>
	</copy>
</target>

Ant compiles things with the “javac” target. It takes a few parameters and optional flags we can use to customize the actual compile command. This task also copies any properties files into the scrub directory.

3) Package up all the class files into a jar file, or some kind of deployable artifact

<!-- *************** -->
<!-- *   Building  * -->
<!-- *************** -->

<!-- Package the logic module -->
<target name="package-logic" depends="compile">
	<jar jarfile="${build.dir}/jars/${logic.file}">
		<fileset dir="${build.dir}/openscope">
			<include name="com/openscope/**"/>
			<include name="*.properties"/>
		</fileset>

		<metainf dir="${resources.dir}">
			<include name="persistence.xml"/>
		</metainf>
	</jar>

	<copy todir="${basedir}/deploy/${ear.file}">
		<fileset dir="${build.dir}/jars">
			<include name="${logic.file}"/>
		</fileset>
	</copy>
</target>

<target name="build-war" depends="package-logic">

	<jar jarfile="${build.dir}/jars/${war.file}"
		basedir="${basedir}/deploy/${ear.file}/${war.file}"/>

</target>

The “jar” task jars up the contents of a directory. We can add files to the META-INF directory with a file include directive under the “metainf” task as part of the “jar” task.

4) Copy the new jar file to the deploy directory

<!-- **************** -->
<!-- * Make the Ear * -->
<!-- **************** -->

<!-- Creates the application ear file. -->
<target name="assemble-app" depends="package-logic,build-war">

	<ear destfile="${build.dir}/${ear.file}"
		basedir="${basedir}/deploy/${ear.file}"
		appxml="application.xml"
	>

	<manifest>
		<attribute name="Built-By"
			value="Openscope Networks"/>
		<attribute name="Implementation-Vendor"
			value="Openscope Networks"/>
		<attribute name="Implementation-Title"
			value="Webminders"/>
		<attribute name="Implementation-Version"
			value="0.1"/>
	</manifest>

	</ear>

</target>

The “ear” task as you can imagine packages up an ear file for deployment. It works very similar to the jar task and offers a few more optional tasks that relate directly to the ear file. Find more tasks on the Ant documentation page.

If you put these basic steps together and add some properties, you will end up with a simple ant script that can build most of your java projects. Customization of course, is where the power of any scripting tool will end up earning its keep. Java docs can be generated as part f the builds, FindBugs can do code analaysis, deployable artifacts can be ftp/scp’d across the network, heck you can even write you own Ant task to do whatever automated unit of work you want to define.

Resources:
Here’s the complete ant script that I use for one of my simple projects.
The Ant task documentation page