« Posts under sql

Subverting foreign key constraints in postgres… or mysql

Temporarily disable key constraints?

On postgres (version 8.1, mind you) I ran across a scenario where I was had to update a set of records that carried foreign key constraints with other tables. I was tasked with updating this table, and the new data may could end up in a state with broken key constraints. The normal postgres replace function would not work as there was no natural regex replace that I could run that would affect all the entries the way I wanted without breaking FK constraints. Ultimately I had to break down my queries in such a way that that at the end of the transaction, the constraints would check out. It turns out that in postgres when you define a foreign key, you can flag it as DEFERRED:

ALTER TABLE tb_other ADD CONSTRAINT tb_other_to_table_fkey 
	FOREIGN KEY (tb_table_pk) REFERENCES tb_table (tb_table_pk) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY IMMEDIATE;

With the alter table command above we can then make use of this DEFERRABLE clause -this flag tells postgres that this constraint check may be deferred until the end of the transaction. The INITIALLY IMMEDIATE clause tells postgres the default constraint behavior is to check the constraint immediately, when the transaction attempts to perform the corresponding delete or insert. You can also flag the constraint to be INITIALLY DEFERRED. Initially deferring as you might guess tells postgres to check the constraint at the end of the transaction. I think generally if you want constraints though, you will probably want to check immediately. It’s good to know you have the option if you really need it though.

Once the foreign key constraint is set as deferrable, we can then execute a script like this to defer the constraint checks until the end of the transaction:

-- postgres deferred constraints in action
begin;

SET CONSTRAINTS ALL DEFERRED;

delete from tb_table;

insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);

commit;

Pretty useful in my opinion. I think I prefer this solution as opposed to disabling triggers across the table since disabling triggers is a schema change and you end up being responsible for restoring them once you’re done. Consider the following :

-- postgres disabled triggers
begin;

ALTER TABLE tb_site DISABLE TRIGGER ALL;

delete from tb_table;

insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);

-- make sure to restore the triggers
ALTER TABLE tb_site ENABLE TRIGGER ALL;

commit;

In this implementation you end up altering the schema to disable all the triggers associated with this table. Don’t forget to re-enable the triggers at the end of the transaction of the disabling will remain in place. Another thing to consider is if you have auditing type of triggers on your target table, you will then end up having to manually fire those triggers or run the appropriate clauses to perserve the original trigger’s integrity. This kind of thing could quickly turn into quite the problem if not handled correctly.

Mysql’d keys

The mysql approach handles this case very similar to the disabled triggers – instead, it uses a system variable called FOREIGN_KEY_CHECKS that can be toggled on or off:

 

-- mysql key constraint supression
begin;

-- lift 
SET FOREIGN_KEY_CHECKS=0;

delete from tb_table;

insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);
insert into tb_table values ( nextval(sq_table), value1, value2, value3);

-- put back when you're done
SET FOREIGN_KEY_CHECKS=1;

commit;

As you can see it’s a very similar approach to the trigger disable in postgres. From the documentation at the time of this writing (mysql version 5.5 – Deferred Foreign Keys in MySql) it looks like deferred keys are just not an option in mysql even though it’s listed as a standard. Worthy of notice.

References:
Postgres Set Constraints
Postgres Create Table documentation

Run a huge query as fast and safely as possible

Use this as a last resort

Queries that take a long time are generally a bad thing. If your application requires these kinds of measures to perform its duties, then chances are you really need to revise your table structures and/or you queries – ideally these queries should take seconds at the most, while data warehouse type reporting queries should be on the order of minutes. That said, sometimes you may need to update your entire schema, delete columns on a table with millions of records, or run a stored proc that goes in and cleans up data across several sets of tables across untold numbers of rows. If you try to run it from putty, or any other remote terminal and anything happens that might sever your connection, you might end up SOL with a rolled back exception that would leave you exactly where you started – with no data updated. These are some strategies you can use to mitigate the risk and cut down on the query time.

Try different strategies

Consider running a proc that pulls one million records, and then updates each record individually – you might want to get some popcorn since that update might take a while. That kind of update is a linear approach and generally bad because it will need to sequentially go through each record one at a time. Divide and conquer might work better – you could try batch updates across segments of the table where indexes are used – something like:

update table set column = value where constraint = 'arbitrary value';
update table set column = otherValue where constraint = 'some other value';

Another approach could be to reconstruct the table using the data from your target table, while filtering out or substituting in hardcoded values for the data you want to replace:

insert into clone_table
select primary_key, column, now() as activated_date, 
	other_column, true as is_active
from table 
where status = 'active'

You could use this approach to reconstruct your table with the data you want and then swap table references on the foreign keys. That part might get a little tricky but if you do it right, using select insert could end up saving you quite a bit of time – select inserts could take minutes while updates could take orders of magnitude longer.

Use screen to wrap your remote session

If your database is running on unix, without a doubt you’ll want to use screen if you need to run a very long query. If your database is on linux, I’m not sure there’s an equivalent. Anyone that’s used putty or some other terminal type of remote console app knows what it’s like to have some long running process terminate prematurely because the connection was severed, or your computer crashed. Screen saves you from those infrequent occurrences by creating an emulated session that can be detached/re-attached such that if you do get disconnected, you can go back and pick up where you left off. It’s very handy for executing a long running process where a disconnect would either cancel the proc and normally terminate the session.

To invoke screen, just type the word screen into the command prompt:

[root@bedrock ~]# screen

This will start your screen session. This may or may not provide some relevant information at the bottom of the screen like in the example below depending on your flavor of unix or configuration:

[root@bedrock ~]#

[ bedrock ][ (0*bash) ][2011-09-09 21:57 ]

Now that screen is up, you can disconnect your terminal app without fear that your screen session would terminate prematurely. You can then log back into the unix box and get a listing of all the current screen sessions with the following command:

[root@bedrock ~]# screen -ls
There are screens on:
     27470.pts-0.bedrock (Attached)
     8177.pts-0.bedrock (Detached)
     mySessionName (Detached)
3 Sockets in /var/run/screen/S-agonzalez.

I should point out that the session name is organized like [processId.sessionName]. You can name your session upon creation with the following command:

[root@bedrock ~]# screen -S yourSessionName

Once you’ve found the right screen session (they’re listed by session name) you can re-attach your severed session with the following command:

[root@bedrock ~]# screen -r mySessionName
There are screens on:
27470.pts-0.bedrock (Attached)
8177.pts-0.bedrock (Detached)
2 Sockets in /var/run/screen/S-agonzalez.

Once you’re in screen it’s useful to know a few keyboard commands to get around:

Control+c, pause then Control+d Detaches your session without termination
Control+c, then h screen capture, and save to your home directory as hardcopy.x (x being the number)
Control+c, then C (capital c) clear the screen of text
Control+c, then N (capital n) display information about the current screen window
Control+c, then ? help screen!

You can find more command, options and details at this screen manpage.

Run your query through a local pipe

If your query pulls back a lot of data, its going to require bandwidth to pipe it all back to your remote client. Don’t use remote clients (lik pgAdmin, MySQL workbench, SQuirreL etc) unless you’re running them directly on the box that’s running your database. Connect remotely and log in through a local pipe however you’re supposed to connect to the local command line:

[root@bedrock ~]# psql -l username my_pg_database
Welcome to psql 8.1.21, the PostgreSQL interactive terminal.

Type: copyright for distribution terms
h for help with SQL commands
? for help with psql commands
g or terminate with semicolon to execute query
q to quit

ml_publisher=#

You would be amazed how much faster a query runs when you’re running it directly on the machine. To give you an idea – running an update across 2 million rows might take an hour if you’re running from a remote client, while running it directly on the box might take mere minutes. We’re talking orders of magnitude of performance – for most development remote is perfectly fine, but for heavy lifting you can’t beat a local pipe.

Now you can run your query… If you’re running on a local pipe and you’re running on screen, you should be able to sever your screen connection without terminating your super long query. Let’s hope that query doesn’t table lock everything to kingdom come!

Write a Stored Procedure in Postgres 8+

Stored Procs

Sometimes as a developer we’re tasked with data intensive work like importing data into a database, cleaning up sets of incomplete records or transferring data from one table to another through some kind of filter. While our application would normally be in charge of creating and maintaining the data, sometimes we don’t want to end up writing an entire module or mini application to address these tasks. Since they’re data intensive, a stored procedure might be a good approach to take. Stored procedures are a type of program written using a more robust version of sql (structured query language) that allows for the manipulation of data records directly within a database environment.

If we were to write the equivalent code using a layer written in java, .net, or php, there would be a lot of overhead cost in terms of processing power and performance – orders of magnitude more. As data is processed, results would normally be returned to that calling layer and shuffled around that layer’s memory, essentially adding another step to the process. If we make these changes as close to the data as possible, we’ll be able to squeeze as much performance as possible and suffer the least amount of overhead. Just for perspective here’s an example: a 1 gigabyte file could take several hours to import using java business logic, while a stored proc could take less than half an hour. Mileage may vary of course, but that’ll give you an idea of the performance cost you could save with data intensive tasks like that. A word of caution though: I’m not saying a stored proc is the way to go for your entire application; it’s merely a tool that can be used in your arsenal to get the job done with the most efficient means possible.

Example

Here’s an example of a generic stored proc written in psql (postgres version).

CREATE OR REPLACE FUNCTION example_stored_proc() RETURNS void AS $$ 
DECLARE 
     userRecord record; 
     user_property_id bigint;
BEGIN 
     FOR userRecord IN  
          SELECT * FROM tb_user u ORDER BY u.user_id 
     LOOP 
          SELECT INTO user_property_id nextval('sq_user_property'); 

          -- user_property_id now has a value we can insert here
          INSERT INTO tb_user_property VALUES(
                    user_property_id ,
                    'user_id',
                    userRecord.id
          ) ; 
 
          IF userRecord.email like 'user@domain.com' THEN

                    update userRecord set email = 'user@other-domain.com' where id = userRecord.id;

          ELSEIF userRecord.email is null THEN

                    update userRecord set active = false where id = userRecord.id;

          ELSE

                    RAISE NOTICE 'didn't update any record';

          END IF;

          RAISE NOTICE 'added property for user id: %', userRecord.id; 
       
     END LOOP; 
     RETURN; 
END; 
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION example_stored_proc() RETURNS integer AS $$
CREATE OR REPLACE FUNCTION will create the stored proc in the database. RETURNS declares the data type returned at the end. This example returns an integer, but a record or a result set may also be returned. The text in between the two pairs of $$ is the body of the procedure.

DECLARE
This keyword initializes the variables the stored proc will be using. It essentially lets the database know to allocate memory for use.

BEGIN
This marks the beginning of the stored proc logic. It naturally ends with END.

FOR userRecord IN
SELECT * FROM tb_user u ORDER BY u.user_id
LOOP

– – do stuff

END LOOP;

This is the basic looping structure used in psql. Notice the loop is built around a straight forward sql query – here is where the magic happens. The looping variable in this example is “userRecord” – it holds the currently fetched data record and allows you to manipulate it for your own means in the body of the loop. So, if you wanted to insert the value of userRecord.id into a table, you could just stick in the insert statement as a variable as shown in the insert statement in particular loop’s body.

SELECT INTO

Using this construct allows you to create a temporary table to hold query results for later use. Your variable can be a record or a single column value. In order for it to work you need to declare the variable that’s going to take the value in the DECLRARE section of the stored proc. Inline variable declaration is not supported.

Conditionals

As expected, the IF/THEN/ELSEIF/ELSE/END IF construct can be used to create conditional sequences of logic. The conditionals need to be any kind of expression postgres can evaluate. The ELSEIF can be used to wrap secondary conditionals, while the ELSE of course is the default if no other conditions are met. Fairly self explanatory.

RAISE NOTICE

This is your standard psql logging output statement. The text in the single quotes is output to the console/message window, and every “%” is substituted with the ordered value after each comma in the statement. So, in this case “userRecord.id” is substituted into the first % to appear in the output text. If you wanted to have multiple values output you could construct your RAISE NOTICE like this:

RAISE NOTICE 'this is record % out of 1000, and its value is %', record_number, record_value; 

It would substitute record_number into the first % and record_value into the second % appearing in the text.

Set up postgres

Most linux oses come with postgres and mysql out of the box. If yours doesn’t or you want to run a newer version than the the one your os comes with you should be able to install it using the yum installer. Fedora 11 comes with a visual installer you can use to pick and choose what rpms you want running on your machine. Once you have postgres installed on your box you’ll want to initialize some dbs stuff:

[root@bedrock ~]# service postgresql initdb

open up postgres to listen to domains other than localhost:

[root@bedrock ~]# vi /var/lib/pgsql/data/pg_hba.conf

# connect from anywhere but use a cleartext password
host    all         all         0.0.0.0/0             password

There are a lot of flexible options you can use to configure permissions for logging into the database, ip restrictions, usernames, domains, authentication policies, its quite extensive. The pg_hba.conf file has a lot of examples you can gloss over and apply as you like. Once you’ve figured this out you’ll want to fire up the service:

[root@bedrock ~]# service postgresql start

also make sure the postgres service is flagged to fire up on boot:

[root@bedrock ~]# ntsysv

Then you’ll want to log into the database and set up some permissions:

[root@bedrock ~]# psql -d template1 -U postgres

Welcome to psql 8.3.8, the PostgreSQL interactive terminal.

Type:  copyright for distribution terms
       h for help with SQL commands
       ? for help with psql commands
       g or terminate with semicolon to execute query
       q to quit

template1=# create database new_database_dev;
CREATE DATABASE
template1=# grant all privileges on database new_database_dev to admin;
GRANT

template1=# create user developer with password 'somepassword';
CREATE ROLE

template1=# alter user developer with password 'password';
ALTER ROLE

and that’s it, you should be ready to work with your new postgres database.