Title Slide
Presented July 2003 at O'Reilly Open Source Conference, Portland, OR
Caveat: These are my notes on this talk. They are not necessarily complete nor are they a definitive record of any live talk, real or imagined.
Modifications to content of the original script based on new information are noted in italics.
I will demonstrate how to use plpython functions in PostgreSQL in order to perform inter-row calculations
You don't need to know plpython for this talk.
This technique can be applied to many different problems. Using this technique you can create simple or elaborate running counters, sums, differences between rows and whatever other useful functions you want to execute on a running set of data.
I would like to acknowledge my comrade David Gould who came up with the original idea of doing inter-row calculations with plpython.
For a discussion of recent changes in plpython, see General Bits Issue #32
I've worked exclusively on and with ORDBMS for more than 10 years and in the database industry for more than 15 years.
I'm the editor/author PostgreSQL General Bits, a web column of informational items mostly from the mailing list pgsql-general.
I'm currently working on a book about Extensibility in PostgreSQL. (No ETA...)
And I have recently released a beta version of a simple reporting engine for PostgreSQL written in perl.
I also do database consulting and PostgreSQL support for various types of applications.
Some of the possibilities of the extensibility features will be shown in this talk, with the creation of server side functions.
Server functions are invoked in SQL and processed in the server. (The mechanics of this is a whole 'nother talk ...)
The client or middleware sees the data that the server sends up. SO other than calling SQL with functions, the client usually needs to know nothing about server functions.
plpython is a server side procedural language based on python.
PostgreSQL supports many types of server side procedural languages. For example,
plpgsql is comparable to many other kinds of "SPL" type languages in other databases.Trigger functions, (which we will not be talking about here) can only be written using C, plpgsql or plpython. Trigger functions can also be written in pltclplpython and plperl enable you to write server side functions in python or perl.
This talk is just about a minor trick to process your data. The implementation is almost trivial with the tools that are available in PostgreSQL.
This functionality is also available in C functions. But the point in writing it in plpython is that it is simpler for those who are not C hackers. (C functions also is a whole 'nother talk.)
Slide four
eg: counters, running sums, etc.
You can create aggregates in PostgreSQL but they have to be written in C and don't usually produce intermediate results except via group by clauses.
It was pointed out that it is possible to write user defined aggregates in other languages. I will investigate further and this will be another item in another column.PostgreSQL also has many built in aggregates.
I am also making a distinction between running aggregates and "ordinary" aggregates like the built-in aggregates already available in PostgreSQL like avg, stddev, sum, count, etc.
The process of an aggregate is:
Different aggregates have variations on these steps. In the case of running aggregates, the one row processing returns an intermediate value and the final return and end are not different than the one row processing.
Yes, this could be done in many different ways, but the point is to demonstrate the technique.
The function simple_counter() is executed once for each row qualified by the selection.
[SLIDE EXAMPLE]
In the case of plpython functions there are three predefined dictionaries in which you can store named data.
Memory leaks on the connection are caused by saving lots of data at once.
plpy.error("msg") -- aborts transactionI use plpy.notice() for debugging plpython functions.
plpy.fatal("msg") -- messy
plpy.debug("msg") -- like notice but tuneable
plpy.notice("msg") -- notice to logfile and stderr: good for immediate debugging
Let us look at HOW and then WHY.
[SLIDE CODE]
This is HOW is fails.
[SLIDE CODE]
How this fails can get even weirder when you use the function in several of interleaved queries or call it multiple times within one statement.
The scope of the SD dictionary is the connection not really the statement as its name implies.
The scope is also limited to the function. Different functions have separate dictionaries so you don't have to worry about dictionary names clashing between functions.
But you do need to worry about dictionaries of the same function clashing. Calling the same function on the same connection will use the same static dictionary.
The value of SD["currval"] remains in existence from the first assignment through the closing of the connection.
Now we need to know HOW to fix this and WHEN to fix this.
Lets start with seeing how to change the function for initialization.
HOW: One part of the problem is that we need to initialize the SD value.
So first let's change the function to allow initialization of the current value.
We choose to init it by passing in an initialization flag. If the flag is 1 the function initializes the value. Otherwise it behaves as it did before.
[SLIDE CODE]
The next problem is WHEN to initialize the current value. In general you would like it initialized before each SQL statement.
But it cannot be done anywhere within the statement or it would be executed for every row selected.
call the function with the init flag in the statement before using the function.[SLIDE CODE]
Note, wrapping these two statements in yet another function won't work, why? This solution does not work for multiple invocations of the function within one statement. Why?
Connection Pool Implications:
You must send statements together on the same connection.
Also Remember that:
you don't know which is the first row, or the last row
There may be a change we can make to plpython to create a properly scoped dictionary for this task. If the scope of the dictionary were the duration of the function instance, there would be no need for this hokey initialization. No promises, but I'll bother people for the details and see what can be done.
All this is possible with C functions. But also with plpython.
I will show one more here.
SET UP BUS EXAMPLE:
For example
you may want to have a function return the previous row's value
This would be done by:
We know WHERE the buses are stopping and WHEN
We want to know the ACTUAL time differences between buses at each stop.
This means two things. First that we need to know:
For each bus stopping at a stop, when the previous bus (if any) was there. Group the results by Stop[PICTURE]
These are the results we want to see. Disregard the yard rows. bad data. oops.
[SLIDE TABLE]
Note that bus #1 arrived @ ashby @ 10:59. The next bus #3 arrived @ ashby @ 11:14 There was 15:00 minutes between those buses. The next bus #5 arrived @ ashby @ 11:18 3 minutes later. Bus #7 then arrived @ ashby @ 11:43, 24 minutes later.
What we now know is that the driver on #5 was going a little fast and the driver of #7 was going a little slow.
This is what we want to know.
Also note that we handle the first arrivals at a stop in a reasonable manner.
So if we already have a function which returns the previous time, what would the query look like to produce this data.
[SLIDE CODE]
Key Points
select prevtime(1,NULL,NULL);
select bus, stop, atime, ptime, atime - ptime as headway
from
(select bus, stop, atime, prevtime(0, stop, atime) as ptime
from (select bus, stop, atime
from buslog where route = 'BART'
order by sord, atime, bus, stop) foo) bar;
Pass in an initialization flag.
Pass in the stop identifier.
Pass in the arrival time.
Return the previous arrival time.
create or replace function prevtime(integer,text, timestamptz)
returns timestamp as
'
if args[0] == 1: | Initialization
SD["curr"] = None | initialization
SD["prev"] = None | initialization
SD["currstop"] = None | initialization
return "12/31/1959" | return my birthday
else: | Runtime
if SD["currstop"] == None: | First Stop?
SD["currstop"] = args[1] | save stop
SD["curr"] = args[2] | save current time
return None | but return null
if args[1] != SD["currstop"]: | New Stop?
SD["currstop"] = args[1] | save stop
SD["curr"] = args[2] | save current time
return None | but return null
| Else same stop, previous time exists
SD["prev"] = SD["curr"] | set previous to old current time
SD["curr"] = args[2] | save new current time
return SD["prev"] | return old current time (now prev)
' language 'plpython';
The cases for First Stop and New Stop can be combined.
They are broken out for clarity.
Do not forget to read about the latest changes in plpython. See General Bits Issue #32