Home » RDBMS Server » Performance Tuning » Updating text string to mask address characters (Oracle 11)
Updating text string to mask address characters [message #632771] Thu, 05 February 2015 09:33 Go to next message
brown_zzz
Messages: 39
Registered: August 2012
Location: United Kingdom
Member

I'm trying to do an update on a column in 100 millions rows in a table to mask the real addresses by replacing an x for every letter except the first in every word in a string. Hence "22 The Old Kent Road" becomes "22 Txx Oxx Kxxx Rxxx".

I've tried both statements below and they're both very slow - no significant difference in speed. Does anyone know a faster way of doing it please?

To test:

SELECT regexp_replace(INITCAP(LOWER('22 the old kent road')),'[a-z]','x') from dual;
SELECT regexp_replace(INITCAP(LOWER('22 the old kent road')),':lower:','x') from dual;

To implement:

UPDATE my_table set my_column_string = regexp_replace(INITCAP(LOWER(my_column_string)),'[a-z]','x');
UPDATE my_table set my_column_string = regexp_replace(INITCAP(LOWER(my_column_string)),':lower:','x');
Re: Updating text string to mask address characters [message #632777 is a reply to message #632771] Thu, 05 February 2015 10:01 Go to previous messageGo to next message
Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web
Senior Member
For any performance related issues, please post the required details as mentioned in http://www.orafaq.com/forum/t/84315/

[Updated on: Thu, 05 February 2015 10:03]

Report message to a moderator

Re: Updating text string to mask address characters [message #632779 is a reply to message #632777] Thu, 05 February 2015 10:13 Go to previous messageGo to next message
BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal
Senior Member
>I've tried both statements below and they're both very slow
Quantify exactly how much slower
UPDATE my_table set my_column_string = regexp_replace(INITCAP(LOWER(my_column_string)),':lower:','x');
is from
SELECT COUNT(MY_COLUMN_STRING) FROM MY_TABLE;

You can't access 100 million rows instantaneously.
Re: Updating text string to mask address characters [message #632780 is a reply to message #632779] Thu, 05 February 2015 10:22 Go to previous messageGo to next message
BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal
Senior Member
What is elapsed time difference between 2 UPDATE statements below?
UPDATE my_table set my_column_string = NULL;
ROLLBACK;
UPDATE my_table set my_column_string = regexp_replace(INITCAP(LOWER(my_column_string)),':lower:','x');
Re: Updating text string to mask address characters [message #632802 is a reply to message #632780] Thu, 05 February 2015 18:41 Go to previous messageGo to next message
rleishman
Messages: 3728
Registered: October 2005
Location: Melbourne, Australia
Senior Member
The performance of updating every row in a large table has very little to do with the functions used in the SELECT list. It has far more to do with the amount of UNDO (rollback) IO and REDO (recovery) IO you are generating.

A neat way to minimise these overheads is to create another table that contains the values you want (CREATE TABLE new AS SELECT func(...), other columns FROM old). Then when you are done, drop (or rename) the old table and rename the new table in its place. Also remember to rebuild indexes.

Ross Leishman
Re: Updating text string to mask address characters [message #632824 is a reply to message #632802] Fri, 06 February 2015 02:43 Go to previous messageGo to next message
cookiemonster
Messages: 13917
Registered: September 2008
Location: Rainy Manchester
Senior Member
And re-add any constraints and triggers
Re: Updating text string to mask address characters [message #632836 is a reply to message #632824] Fri, 06 February 2015 03:15 Go to previous messageGo to next message
FDAVIDOV
Messages: 20
Registered: December 2014
Junior Member
Brown_zzz,

I don't know if you solved your problem, but anyway here is my 5 cents contribution.

The fact that you are able to "PHRASE" the replacement in a very ellegant way (using regular exp.) does not change the fact that the PROCESSOR needs to handle each character one by one. When you use regexp_replace you don't really have any control on HOW the analysis of the string takes place, nor how and when the replacement is done.

I would suggest that you try implementing a small piece of code that scans the string character by character and (sort of State-machine):

- Define a variable that would display the STATE (e.g. 'IN_WORD', 'IN_SEPARATOR', 'IN_NUMBER' - I use here strings to explain the idea, and you may wish to use a number instead of strings to increase performace),
- Define a variable (TARGET VARIABLE) to receive the new "value" after replacing character with "X",
- The first LETTER you encounter you copy as-is to a target variable and set the state to 'IN_WORD',
- While in state 'IN_WORD', any character (letter, symbol, digit) you find till the next separator (blank, comma, etc.) you append an "X" to the target variable,
- As soon as you find a separator (" ", ",", etc.) you change the state to 'IN_SEPARATOR' and append as-is the separator to the target string,
- When you are in state 'IN_SEPARATOR' and find a digit, you change the state to 'IN_NUMBER' and append as-is any additional character to the target variable till the next separator (or end of string).


I'm not sure if this is the very best and fastest algorithm, but I'm quite sure you will see a significant boost in performance.

As for the suggestion of copying the data to a new table... I'm not sure if this will help (100 million records to copy is not something small).

Good luck,

Fernando.
Re: Updating text string to mask address characters [message #632840 is a reply to message #632836] Fri, 06 February 2015 03:32 Go to previous messageGo to next message
Michel Cadot
Messages: 68624
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator

Quote:
I would suggest that you try implementing a small piece of code that scans the string character by character


And you think than a piece of (complex) PL/SQL code will be faster than a built-in function?

Quote:
As for the suggestion of copying the data to a new table... I'm not sure if this will help (100 million records to copy is not something small).


It is surely far faster than updating the same number of lines.

Re: Updating text string to mask address characters [message #632841 is a reply to message #632840] Fri, 06 February 2015 03:48 Go to previous messageGo to next message
FDAVIDOV
Messages: 20
Registered: December 2014
Junior Member
Michel,

No doubt, a built-in function that DOES THE SAME would be faster than a (NO SO COMPLEX) piece of PL/SQL.

My point is that with regexp_replace you may find youself scanning the whole string several times instead of once (of course, if you know the exact details of the internal algorithm implemented within this function, I might change my opinion).

As for copying the whole table, once again, I'm not so sure (if may depend on, for instance, how many and complext indices the table has). Also, I don't quite know how the internals of Oracle handle updates, but if I would need to guess, it would smartly exploit the fact that both initial and final strings are of the same size (if not, it would be a little bit disapointing...).

Developing the small functionality I mentioned is something around 1/2 how or so, and hence I think it is worth trying.

Wouldn't you be happy to see some EVIDENCES proving you RIGHT or WRONG?

Of course, Brown_zzz my think differently...


Re: Updating text string to mask address characters [message #632845 is a reply to message #632841] Fri, 06 February 2015 04:00 Go to previous messageGo to next message
cookiemonster
Messages: 13917
Registered: September 2008
Location: Rainy Manchester
Senior Member
FDAVIDOV wrote on Fri, 06 February 2015 09:48
Michel,

No doubt, a built-in function that DOES THE SAME would be faster than a (NO SO COMPLEX) piece of PL/SQL.

My point is that with regexp_replace you may find youself scanning the whole string several times instead of once (of course, if you know the exact details of the internal algorithm implemented within this function, I might change my opinion).

So you're assuming that oracle can't write a competent regexp?
FDAVIDOV wrote on Fri, 06 February 2015 09:48

As for copying the whole table, once again, I'm not so sure (if may depend on, for instance, how many and complext indices the table has).
Also, I don't quite know how the internals of Oracle handle updates, but if I would need to guess, it would smartly exploit the fact that both initial and final strings are of the same size (if not, it would be a little bit disapointing...).

Well if you have a lot of indexes then it's possible the time spent recreating them outweighs the time saved by doing CTAS, but absent that it's a known fact among people who do these things that CTAS is a lot faster than update, try it if you don't believe us.

FDAVIDOV wrote on Fri, 06 February 2015 09:48

Developing the small functionality I mentioned is something around 1/2 how or so, and hence I think it is worth trying.

Wouldn't you be happy to see some EVIDENCES proving you RIGHT or WRONG?


Why don't you knock up an example then and see how it compares?
Re: Updating text string to mask address characters [message #632846 is a reply to message #632841] Fri, 06 February 2015 04:03 Go to previous messageGo to next message
FDAVIDOV
Messages: 20
Registered: December 2014
Junior Member
Michel,

One more thing we somethines tend to forget:

As fas as I recall, the Oracle engine is written in "C" language, right? (in fact, it doesn't really matter if it is "C", "C+", "C++", or any other langauge; the way ALL programming languages TRANSLATE into machine code is more or less the same for the current available processors technologies).

Now, think how would your "C" code would look if you would need to implement this functionality...

All the best,
Fernando.

[Updated on: Fri, 06 February 2015 04:04]

Report message to a moderator

Re: Updating text string to mask address characters [message #632847 is a reply to message #632845] Fri, 06 February 2015 04:12 Go to previous messageGo to next message
FDAVIDOV
Messages: 20
Registered: December 2014
Junior Member
Michel,

If I had a BIG table on which I could excersice this, I would be more than happy to test it and share with you the results (even if they prove I'm completely wrong).

Let me know if you DO have such a bog table to play with, and I'll try to send you a sample code realizing this idea.

As for trusting Oracle programmers, of course I do. What I'm saying is that the function in question, being extremely powerful and handy, might not be the best choice for this particular problem.

Copying table: Agree (missing information). Still, it would be quite interesting making some tests. Wouldn't you agree?

Fernando.
Re: Updating text string to mask address characters [message #632849 is a reply to message #632847] Fri, 06 February 2015 04:17 Go to previous messageGo to next message
cookiemonster
Messages: 13917
Registered: September 2008
Location: Rainy Manchester
Senior Member
You can create a big table in seconds:
CREATE TABLE big_table AS (SELECT ROWNUM||' the old kent road' AS address FROM dual CONNECT BY LEVEL < 1000000);


Feel free to tweak as you wish.
Re: Updating text string to mask address characters [message #632851 is a reply to message #632849] Fri, 06 February 2015 04:27 Go to previous messageGo to next message
FDAVIDOV
Messages: 20
Registered: December 2014
Junior Member
Thanks.

Crating a big table is not the issue, but having an environment within which I can play is something else (many developing environments at work, but not suitable for "games" of this sort as none has enough space to create big enough tables, and people working on all of them all the time, affecting the results of performance tests).

I'll try to find the way and, if not, will send you the code for string replacemement in the event you want to run the tests at your side.

Of course, both methods will need to be executed and the results compared.

All the best,

Fernando.
Re: Updating text string to mask address characters [message #632917 is a reply to message #632851] Fri, 06 February 2015 21:27 Go to previous messageGo to next message
rleishman
Messages: 3728
Registered: October 2005
Location: Melbourne, Australia
Senior Member
Baseline: Generate 1M rows with varying mixed case content

WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
SELECT MAX(INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')))
FROM n n1
CROSS JOIN n n2


Result: 2.66 seconds

Progression: REGEXP_REPLACE with character class
WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
SELECT MAX(REGEXP_REPLACE(INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')), '[[:lower:]]', 'x'))
FROM n n1
CROSS JOIN n n2


Result: 15.65 seconds (cost = 13sec / 1M)

Progression: REGEXP_REPLACE with Character Range
WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
SELECT MAX(REGEXP_REPLACE(INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')), '[[:lower:]]', 'x'))
FROM n n1
CROSS JOIN n n2


Result: 15.37 seconds (cost = 13sec / 1M)

Progression: TRANSLATE()
WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
SELECT MAX(
         TRANSLATE(
             INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp'))
         ,   'abcdefghijklmnopqrstuvwxyz', 'xxxxxxxxxxxxxxxxxxxxxxxxxx'))
FROM n n1
CROSS JOIN n n2


Result: 20.56 seconds (cost = 18sec / 1M)

Conclusion: Over the course of 300M rows, there is 3900 seconds spent on REGEXP_REPLACE. That's about an hour. (I ran my tests on Oracle's APEX cloud service; your mileage might differ, but you could run the queries above locally to get some benchmarks).

If that's going to make a critical difference to your query, then by all means go looking for a better alternative to REGEXP_REPLACE (not that I think you'll find one)

However, I very much doubt you would have posted this here if it had finished in a couple of hours. If I'm right and it's taking many, many hours, then maybe you should go back and check out that CTAS solution we discussed above.

Ross Leishman

Re: Updating text string to mask address characters [message #632922 is a reply to message #632771] Sat, 07 February 2015 02:15 Go to previous messageGo to next message
John Watson
Messages: 8922
Registered: January 2010
Location: Global Village
Senior Member
May I suggest a radical solution? Use the data redaction facility, available from 11.2.0.4.
It has many advantages over the technique you are using now. First, it is declarative: set up the policy, and everyone gets it according to session context. Second, the redaction policies can of course be much more sophisticated than the masking demo'ed so far. Third, it occurs (I think) at column projection time, so no performance hit. Lastly, you don't actually update anything, so the underlying data (incuding performance characteristics) are preserved.
Yes, you have to buy the Advanced Security Option - but you may find it easy to motivate the purchase. You get TDE and network encrption as well.
Re: Updating text string to mask address characters [message #632947 is a reply to message #632917] Sun, 08 February 2015 00:32 Go to previous messageGo to next message
FDAVIDOV
Messages: 20
Registered: December 2014
Junior Member
Hello guys,

I'm addressing here two entries: the one from Ross Leishman and the one from John Watson.

Will start from the second (easier and shorter):
It would be reasonable to assume that, if the only thing Brown_zzz needs is to mask information upon display/select, he/she could just do that (e.g. select REGEXP_REPLACE(...)) and wouldn't post the question to begin with. Hence, it appears natural to think that he/she has all the addresses in the DB already and needs to mask them.

Now to Ross:

I made a small experiment which is generating a date using to_char(SYSDATE + i ,'DAY DD MONTH YYYY') [here, "i" is the index in a loop from 1 to nnn] getting "MONDAY 09 FEBRUARY 2015". Then, used this string as the input for masking using two methods: (1) a simple function as I suggested in a previous entry and (2) the REGEXP_REPLACE function.

The results are as follows (looping 1,000,000 times in both cases; the "Cost" was manually calculated and added to the printing of the code - see below):

Using my simple function:
Starting loop at: 08-02-2015 07:54:15.287399000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 07:54:37.772596000   -----> Cost: 22.485197 secs.



Using regexp_replace:
Starting loop at: 08-02-2015 07:55:56.270955000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 07:56:20.162582000   -----> Cost: 23.891627 secs.

Conclussions:
1) For THIS PARTICULAR NEED(!), it would appear that using a custom function gives better results than using built-in functions (though a different of ~1.5 secs is too small to make a final assertion),
2) Update the existing table VS. create a new table needs to be further examined (whoever can do it, should run both scenarios and, if kind enough, post the results herein).

Some additional comments to Ross' findings:

The test using TRANSLATE is neither "fare" nor applicable to the actual need (i.e. first letter of a name should remain unchanged). Not "fare" because you are asking the system that, for each and every character in the input, scan two strings ("ascd..." and "xxxx...") to find the match; in fact, you don't quite care which letter is in the input and you just need to replace it with "x".

Finally, I'm adding here the code I used for the test explained above.

Fernando.


set serveroutput on

DECLARE

    l_Date_String   VARCHAR2(64) ;
    l_Result_String VARCHAR2(64) ;
    l_Loops         NUMBER       ;
    

    FUNCTION Mask_String(p_In_String  VARCHAR2) RETURN VARCHAR2
    
    AS
    
        l_Length        NUMBER              := LENGTH(p_In_String)  ;
        l_State         NUMBER              := 0                    ; -- 0 = init, 1 = in word, 2 = in number, 3 = in separator
        l_Result        VARCHAR2(128)       := ''                   ;
        l_This_Char     CHAR(1)                                     ;
    Begin
    
        for i in 1..l_Length
        LOOP
            l_This_Char := substr( p_In_String , i, 1 )  ;
            
            CASE l_State
            
                WHEN 0 THEN
                
                    IF      ( l_This_Char >= '0') AND ( l_This_Char <= '9')  THEN
                        l_State     := 2 ; -- in number ;
                    ELSIF   ( l_This_Char =  ' ') OR  ( l_This_Char =  ',')  THEN
                        l_State     := 3 ; -- in separator ;
                    ELSE
                        l_State     := 1 ; -- in word ;
                    END IF ;

                    l_Result    := l_Result || l_This_Char ;
                    
                WHEN 1 THEN
                    IF      ( l_This_Char =  ' ') OR  ( l_This_Char =  ',')  THEN
                        l_State     := 3 ; -- in separator ;
                        l_Result    := l_Result || l_This_Char ;
                    ELSE
                        l_Result    := l_Result || 'x'         ;
                    END IF ;
                    
                WHEN 2 THEN
                    IF      ( l_This_Char =  ' ') OR  ( l_This_Char =  ',')  THEN
                        l_State     := 3 ; -- in separator ;
                    END IF ;

                    l_Result    := l_Result || l_This_Char ;
                    
                WHEN 3 THEN
                    IF      ( l_This_Char >= '0') AND ( l_This_Char <= '9')  THEN
                        l_State     := 2 ; -- in number ;
                    ELSIF   ( l_This_Char =  ' ') OR  ( l_This_Char =  ',')  THEN
                        l_State     := 3 ; -- in separator ;
                    ELSE
                        l_State     := 1 ; -- in word ;
                    END IF ;
                    
                    l_Result    := l_Result || l_This_Char ;

            END CASE ;
            
        End Loop ;
            
        RETURN l_Result ;
        
    End ;
    
BEGIN

    l_Loops := 1000000 ;

    dbms_output.put_line('Starting loop at: ' || to_char(systimestamp,'dd-mm-yyyy hh24:mi:ss.FF') ) ;
    
    FOR i IN 1..l_Loops
    Loop
        
        l_Date_String   := to_char(SYSDATE + i ,'DAY DD MONTH YYYY') ;
        
        --- l_Result_String := Mask_String(l_Date_String) ;
        
        l_Result_String := regexp_replace(INITCAP(LOWER(l_Date_String)),'[a-z]','x') ;

        IF i in (1,l_Loops) THEN
            dbms_output.put_line('Masking of:"' || l_Date_String || '"' || CHR(10) ||
                                 '        is: ' || l_Result_String ) ;
        END IF ;

    End Loop ;
    
    dbms_output.put_line('Ending   loop at: ' || to_char(systimestamp,'dd-mm-yyyy hh24:mi:ss.FF') ) ;


END ;


MOD EDIT: ADDED CODE TAGS - RL

[Updated on: Sun, 08 February 2015 04:51] by Moderator

Report message to a moderator

Re: Updating text string to mask address characters [message #632948 is a reply to message #632947] Sun, 08 February 2015 05:00 Go to previous messageGo to next message
rleishman
Messages: 3728
Registered: October 2005
Location: Melbourne, Australia
Senior Member
A test of rebuild vs UPDATE
SQL> 
SQL> CREATE TABLE bigtest (
  2  pk
  3  ,fk
  4  ,txt
  5  )
  6  TABLESPACE BIG
  7  AS
  8  WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
  9  SELECT
 10  	  CAST(n1.n * 1000000 + n2.n * 1000 + n3.n AS NUMBER(10))
 11  ,	  CAST(MOD(n1.n * 1000000 + n2.n * 1000 + n3.n, 54321) AS NUMBER(10))
 12  ,	  CAST(INITCAP(TO_CHAR(sysdate + MOD(n1.n * 1000000 + n2.n*1000 + n3.n, 10000), 'JSp')) AS VARCHAR2(150))
 13  FROM n n1
 14  CROSS JOIN n n2
 15  CROSS JOIN n n3
 16  WHERE ROWNUM <= 10000000
 17  /

Table created.

Elapsed: 00:01:38.48
SQL> 
SQL> CREATE UNIQUE INDEX bigtest_pk ON bigtest(pk) TABLESPACE BIG;

Index created.

Elapsed: 00:01:03.44
SQL> CREATE INDEX bigtest_fk ON bigtest(fk) TABLESPACE BIG;

Index created.

Elapsed: 00:00:58.91
SQL> 
SQL> 
SQL> CREATE TABLE bigtest1
  2  AS
  3  SELECT pk, fk, CAST(regexp_replace(txt, '[[:lower:]]', 'x') AS VARCHAR2(150)) AS txt
  4  FROM bigtest
  5  /

Table created.

Elapsed: 00:04:04.44
SQL> 
SQL> 
SQL> UPDATE bigtest
  2  SET txt = regexp_replace(txt, '[[:lower:]]', 'x')
  3  /

10000000 rows updated.

Elapsed: 00:17:57.51
SQL> 
SQL> 
SQL> 
SQL> SET TIMING OFF
SQL> SPOOL OFF


Summary:
- Rebuild the table: 4:04
- Build 2 indexes: 2:02
- Total for rebuild 10M row table + 2 indexes: 6:06

- UPDATE 10M rows: 17:57

Can we possibly put this one to bed now and just SUGGEST to the OP that rebuilding the table from scratch is worth a go - not a solution necessarily, just worth a go - without getting too distracted about saving a potential 7% on writing a better function. The reasons - as I discussed earlier - are because of UNDO and REDO IO, which form the bulk of the cost of the UPDATE, and can be largely eliminated by rebuilding the table.

Also John's solution (Data Redaction) bears investigation.

Ross Leishman
Re: Updating text string to mask address characters [message #632952 is a reply to message #632948] Sun, 08 February 2015 11:26 Go to previous messageGo to next message
FDAVIDOV
Messages: 20
Registered: December 2014
Junior Member
Great job Ross!!

I didn't take into account (as you mentioned) the UNDO/REDO overhead which, obviously, favors your suggestion.

Conclussions:

1) I'll mark a big point to Ross,
2) I'll mark a very small point to me (in spite commonly found assumptions, sometimes you CAN write something more efficient than an Oracle internal).

Cheers,
Fernando.

Re: Updating text string to mask address characters [message #632954 is a reply to message #632952] Sun, 08 February 2015 17:44 Go to previous messageGo to next message
rleishman
Messages: 3728
Registered: October 2005
Location: Melbourne, Australia
Senior Member
FDAVIDOV wrote on Mon, 09 February 2015 04:26

2) I'll mark a very small point to me (in spite commonly found assumptions, sometimes you CAN write something more efficient than an Oracle internal).


You're very quick to award yourself a point, even if it is a small one. Let's take a deeper look at your function.

First, I will replicate your results on my own database:

TESTING BESPOKE FUNCTION
Starting loop at: 08-02-2015 21:59:36.701000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 21:59:56.039000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:00:19.593000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:00:38.961000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:00:48.855000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:01:08.403000000

PL/SQL procedure successfully completed.

Summary of results:
1: 19.3
2: 19.4
3: 19.5


TEST 1 : REGEXP_REPLACE [a-z]
Starting loop at: 08-02-2015 22:14:00.874000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:14:21.033000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:14:26.551000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:14:46.529000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:14:52.047000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:15:11.786000000

PL/SQL procedure successfully completed.


Summary of results:
1: 20.2
2: 20.0
3: 19.7


Yep, it definitely looks like the bespoke function is a tiny bit faster than REGEXP_REPLACE.

But what about this line of code in your test case:
l_Result_String := regexp_replace(INITCAP(LOWER(l_Date_String)),'[a-z]','x') ;

That LOWER function seems redundant. It converts the whole string to lower case, but then the INITCAP function does the same thing again except for the initial letters of each word.
Let's try again with the redundant LOWER removed

TEST 2 : REGEXP_REPLACE [a-z] (with removed redundant LOWER)
Starting loop at: 08-02-2015 22:17:01.934000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:17:21.372000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:17:37.826000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:17:57.394000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:18:05.916000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:18:25.064000000

PL/SQL procedure successfully completed.

Summary of results:
1: 19.4
2: 19.6
3: 19.2

Hmmm. These results are almost identical to the bespoke function now. We've lost all of our performance improvements.

Just out of interest (it shouldn't make any difference, right?) let's rerun the PL/SQL for REGEXP_REPLACE with the code for the bespoke function removed from the PL/SQL block. After all, it's not being called in this test, so it's fairer if we run REGEXP_REPLACE with it removed.

TEST 3 : REGEXP_REPLACE [a-z] (with removed redundant bespoke function)
Starting loop at: 08-02-2015 22:23:07.129000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:23:26.327000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:23:46.746000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:24:05.723000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:24:17.060000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:24:36.538000000

PL/SQL procedure successfully completed.

Summary of results:
1: 19.2
2: 19.0
3: 19.5

That was unexpected. Just having that little extra code in the block to parse (and maybe optimise) was costing a couple of tenths of a second. We now have the situation where REGEXP_REPLACE appears faster than the bespoke function - maybe not statistically faster, but it's an interesting result.

But hang on, in my earlier SQL test, it seemed like the Character Class [[:lower:]] in REGEXP_REPLACE was measurably faster than the Character Range [a-z]. Wouldn't it be fairer to pit the bespoke function against the faster built-in function?

Let's replace
l_Result_String := regexp_replace(INITCAP(l_Date_String),'[a-z]','x') ;

With
l_Result_String := regexp_replace(INITCAP(l_Date_String),'[[:lower:]]','x') ;


TEST 4 : REGEXP_REPLACE [[:lower:]]
Starting loop at: 08-02-2015 22:40:23.509000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:40:38.651000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:40:51.610000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:41:06.772000000

PL/SQL procedure successfully completed.

Starting loop at: 08-02-2015 22:41:09.365000000
Masking of:"MONDAY    09 FEBRUARY  2015"
        is: Mxxxxx    09 Fxxxxxxx  2015
Masking of:"MONDAY    05 JANUARY   4753"
        is: Mxxxxx    05 Jxxxxxx   4753
Ending   loop at: 08-02-2015 22:41:25.018000000

PL/SQL procedure successfully completed.

Summary of results:
1: 15.1
2: 15.2
3: 15.6

Right, for the first time we seem to have a statistically significant result. REGEXP_REPLACE with a Character Class appears to be about 25% faster than the bespoke function.

But surely we're missing the whole point here. Didn't the OP present to us a SQL problem? So why are we trying to solve it with a language other than SQL? Surely if this is to have any value, we need to test it in SQL.

TEST CASE 5 : REGEXP_REPLACE VS BESPOKE FUNCTION IN SQL
Let's save the bespoke function to the database so we can use it in SQL
SQL> 
SQL> 
SQL> CREATE OR REPLACE FUNCTION Mask_String(p_In_String  VARCHAR2) RETURN VARCHAR2
  2  
  3  	 AS
  4  
  5  	     l_Length	     NUMBER		 := LENGTH(p_In_String)  ;
  6  	     l_State	     NUMBER		 := 0			 ; -- 0 = init, 1 = in word, 2 = in number, 3 = in separator
  7  	     l_Result	     VARCHAR2(128)	 := ''			 ;
  8  	     l_This_Char     CHAR(1)					 ;
  9  	 Begin
 10  
 11  	     for i in 1..l_Length
 12  	     LOOP
 13  		 l_This_Char := substr( p_In_String , i, 1 )  ;
 14  
 15  		 CASE l_State
 16  
 17  		     WHEN 0 THEN
 18  
 19  			 IF	 ( l_This_Char >= '0') AND ( l_This_Char <= '9')  THEN
 20  			     l_State	 := 2 ; -- in number ;
 21  			 ELSIF	 ( l_This_Char =  ' ') OR  ( l_This_Char =  ',')  THEN
 22  			     l_State	 := 3 ; -- in separator ;
 23  			 ELSE
 24  			     l_State	 := 1 ; -- in word ;
 25  			 END IF ;
 26  
 27  			 l_Result    := l_Result || l_This_Char ;
 28  
 29  		     WHEN 1 THEN
 30  			 IF	 ( l_This_Char =  ' ') OR  ( l_This_Char =  ',')  THEN
 31  			     l_State	 := 3 ; -- in separator ;
 32  			     l_Result	 := l_Result || l_This_Char ;
 33  			 ELSE
 34  			     l_Result	 := l_Result || 'x'	    ;
 35  			 END IF ;
 36  
 37  		     WHEN 2 THEN
 38  			 IF	 ( l_This_Char =  ' ') OR  ( l_This_Char =  ',')  THEN
 39  			     l_State	 := 3 ; -- in separator ;
 40  			 END IF ;
 41  
 42  			 l_Result    := l_Result || l_This_Char ;
 43  
 44  		     WHEN 3 THEN
 45  			 IF	 ( l_This_Char >= '0') AND ( l_This_Char <= '9')  THEN
 46  			     l_State	 := 2 ; -- in number ;
 47  			 ELSIF	 ( l_This_Char =  ' ') OR  ( l_This_Char =  ',')  THEN
 48  			     l_State	 := 3 ; -- in separator ;
 49  			 ELSE
 50  			     l_State	 := 1 ; -- in word ;
 51  			 END IF ;
 52  
 53  			 l_Result    := l_Result || l_This_Char ;
 54  
 55  		 END CASE ;
 56  
 57  	     End Loop ;
 58  
 59  	     RETURN l_Result ;
 60  
 61  	 End ;
 62  /

Function created.


And now lets pit it against REGEXP_REPLACE (Character Class) in SQL, rather than PL/SQL

SQL> WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
  2  SELECT MAX(REGEXP_REPLACE(INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')), '[[:lower:]]', 'x'))
  3  FROM n n1
  4  CROSS JOIN n n2
  5  /

MAX(REGEXP_REPLACE(INITCAP(TO_CHAR(SYSDATE+MOD(N1.N*1000+N2.N,10000),'JSP')),'[[:LOWER:]]','X'))    
----------------------------------------------------------------------------------------------------
Txx Mxxxxxx Fxxx Hxxxxxx Sxxxx-Txxxx Txxxxxxx Txxxxxxx                                              

Elapsed: 00:00:21.93



SQL> WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
  2  SELECT MAX(Mask_String(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')))
  3  FROM n n1
  4  CROSS JOIN n n2
  5  /

MAX(MASK_STRING(TO_CHAR(SYSDATE+MOD(N1.N*1000+N2.N,10000),'JSP')))                                  
----------------------------------------------------------------------------------------------------
Txx Mxxxxxx Fxxx Hxxxxxx Sxxxxxxxxxx Txxxxxxx Txxxxxxxxxxx                                          

Elapsed: 00:00:47.75


Okay, so now the Bespoke function seems twice as slow as the built-in equivalent. This is going to be true for most if not all built-in functions in SQL vs interpreted bespoke functions, because Oracle uses a separate engine to execute interpreted PL/SQL. This means for every call to the PL/SQL function, we wear the cost of a context shift from SQL to PL/SQL. As can be seen above, the cost of these context shifts is NOT negligible.

And in case we were thinking NATIVE compilation with bespoke PL/SQL functions was the cure for all of our built-in function woes, I recompiled the bespoke function using PLSQL_CODE_TYPE = NATIVE

SQL> WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
  2  SELECT MAX(Mask_String(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')))
  3  FROM n n1
  4  CROSS JOIN n n2
  5  /

MAX(MASK_STRING(TO_CHAR(SYSDATE+MOD(N1.N*1000+N2.N,10000),'JSP')))                                  
----------------------------------------------------------------------------------------------------
Txx Mxxxxxx Fxxx Hxxxxxx Sxxxxxxxxxx Txxxxxxx Txxxxxxxxxxx                                          

Elapsed: 00:00:37.20


It was definitely faster. But not faster than the built-in function.

My conclusion:
Maybe you can sometimes write something faster than built-in functions, but not in this case.

Ross Leishman
Re: Updating text string to mask address characters [message #632960 is a reply to message #632954] Sun, 08 February 2015 23:42 Go to previous message
FDAVIDOV
Messages: 20
Registered: December 2014
Junior Member
Ross,

Quite interesting indeed....

Few comments:

1) In your TEST 3 (run the block using regexp_replace and after removing the unused function) you got a better result. This is quite interesting because the START and END time are measured just before and after the loop and hence it is not clear why the function would be parsed more than once (if at all).

2) Creating a function within the DB (TEST 5): from the very beginning, the problem presented by Brown_zzz was in relation to a once-in-a-lifetime operation, and hence you wouldn't pollute the DB with a function that would be used only once, right?

3) Since I'm as happy about being right as I am learning something new, I'm gladly moving my little point to you, so the updated scores now are:

- Ross : 1.1 Points
- Fernando : 0.0 Points
- Oracle's Consistency : -10.0 Points

Shall we put this to rest now?
Previous Topic: Oracle Database 11g 11.2.0.4.0 running slow
Next Topic: Alert log file size can impact database performance
Goto Forum:
  


Current Time: Thu Mar 28 04:10:24 CDT 2024