Performance comparison of word filtering functions Performance comparison of word filtering functions (Part 2)

Performance comparison of word filtering functions (Part 2)

Introduction

For one of my programs, I have to filter and delete some words from a long text from a long text. I need to remove pronouns, conjugated verbs, determiners and some other words. some other words.

In the previous tutorial, I presented and compared several tree-like data structures to speed up the sorting of these words.

In order to complete this comparison, I wanted to test a last structure by taking into account the last results

Multi-branch custom tree (WordTree4x)

In my opinion, the idea of putting several letters in a single branch as I did with the WordTree2g and WordTree3g trees is the right one. The bad performance of these trees comes from the distribution test itself. Indeed to select the right letters to put in each branch, I proceeded to a test with the elem function which slows down a lot the program.

The solution would be to replace this function with a simpler test that is less time consuming. For example a test of equality == or of order <, >, <= or >=. But if we want to do this, we will have to perform the distribution tests according to the alphabetical order of the letters. The idea of using a specific letter order will therefore be abandoned. On the other hand, the grouping of letters will be specific and adapted to the words to be sorted.

By taking again the statistics of the letters, we will have:

A 77
B 4
C 34
D 27
E 133
F 19
G 9
H 4
I 85
J 5
L 46
M 18
N 59
O 61
P 21
Q 21
R 49
S 103
T 90
U 102
V 21
X 9
Y 4
Z 8

For a total of 1009 letters.

For each version, we will place the cuts in order to minimize the difference in the number of elements in each branch.

If we cut at the letter M (included), we will have :

Branch	Nb of elems
A-M	461
N-Z	548

The maximum deviation with a perfect distribution of the elements is $| 548 - 504.5 | = 43.5$

If we cut at the letter N (included), we will have :

Branch	Nb of elems
A-M	520
N-Z	489

The maximum deviation with a perfect distribution of the elements is $| 489 - 504.5 | = 15.5$

The gap is therefore smaller with a cut at the letter N (included).

And we will apply this method to all other trees.

Implémentations

Several tests will be performed with a different number of branches in order to see the influence of this number on the performance of the tree.

We will have the following implementation:

2-branchs custom tree (WordTree42)

member v0 tree = go v0 tree
    where
        go _  Tip                 = False
        go _  t@(Fork vs Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2) | a <= 'l'  = go as t1
                                    | otherwise = go as t2

        go _ _ = False

3-branchs custom tree (WordTree43)

member v0 tree = go v0 tree
    where
        go _  Tip                     = False
        go _  t@(Fork vs Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3) | a <= 'h'  = go as t1
                                       | a <= 'r'  = go as t2
                                       | otherwise = go as t3

        go _ _ = False

4-branchs custom tree (WordTree44)

member v0 tree = go v0 tree
    where
        go _  Tip                         = False
        go _  t@(Fork vs Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4) | a <= 'e'  = go as t1
                                          | a <= 'n'  = go as t2
                                          | a <= 's'  = go as t3
                                          | otherwise = go as t4

        go _ _ = False

5-branchs custom tree (WordTree45)

member v0 tree = go v0 tree
    where
        go _  Tip                             = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5) | a <= 'd'  = go as t1
                                             | a <= 'i'  = go as t2
                                             | a <= 'q'  = go as t3
                                             | a <= 't'  = go as t4
                                             | otherwise = go as t5

        go _ _ = False

6-branchs custom tree (WordTree46)

member v0 tree = go v0 tree
    where
        go _  Tip                                 = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6) | a <= 'd'  = go as t1
                                                | a <= 'h'  = go as t2
                                                | a <= 'm'  = go as t3
                                                | a <= 'r'  = go as t4
                                                | a <= 't'  = go as t5
                                                | otherwise = go as t6

        go _ _ = False

7-branchs custom tree (WordTree47)

member v0 tree = go v0 tree
    where
        go _  Tip                                     = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7) | a <= 'd'  = go as t1
                                                   | a <= 'g'  = go as t2
                                                   | a <= 'm'  = go as t3
                                                   | a <= 'q'  = go as t4
                                                   | a <= 's'  = go as t5
                                                   | a == 't'  = go as t6
                                                   | otherwise = go as t7

        go _ _ = False

8-branchs custom tree (WordTree48)

member v0 tree = go v0 tree
    where
        go _  Tip                                         = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8) | a <= 'd'  = go as t1
                                                      | a <= 'f'  = go as t2
                                                      | a <= 'l'  = go as t3
                                                      | a <= 'o'  = go as t4
                                                      | a <= 'r'  = go as t5
                                                      | a == 's'  = go as t6
                                                      | a == 't'  = go as t7
                                                      | otherwise = go as t8

        go _ _ = False

9-branchs custom tree (WordTree49)

member v0 tree = go v0 tree
    where
        go _  Tip                                             = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9) | a <= 'c'  = go as t1
                                                         | a <= 'e'  = go as t2
                                                         | a <= 'i'  = go as t3
                                                         | a <= 'm'  = go as t4
                                                         | a <= 'o'  = go as t5
                                                         | a <= 'r'  = go as t6
                                                         | a == 's'  = go as t7
                                                         | a == 't'  = go as t8
                                                         | otherwise = go as t9

        go _ _ = False

10-branchs custom tree (WordTree410)

member v0 tree = go v0 tree
    where
        go _  Tip                                                 = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10) | a <= 'c'  = go as t1
                                                             | a <= 'e'  = go as t2
                                                             | a <= 'i'  = go as t3
                                                             | a <= 'm'  = go as t4
                                                             | a == 'n'  = go as t5
                                                             | a <= 'q'  = go as t6
                                                             | a == 'r'  = go as t7
                                                             | a == 's'  = go as t8
                                                             | a == 't'  = go as t9
                                                             | otherwise = go as t10

        go _ _ = False

11-branchs custom tree (WordTree411)

member v0 tree = go v0 tree
    where
        go _  Tip                                                     = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11) | a <= 'c'  = go as t1
                                                                 | a <= 'e'  = go as t2
                                                                 | a <= 'h'  = go as t3
                                                                 | a <= 'j'  = go as t4
                                                                 | a <= 'm'  = go as t5
                                                                 | a == 'n'  = go as t6
                                                                 | a <= 'q'  = go as t7
                                                                 | a == 'r'  = go as t8
                                                                 | a == 's'  = go as t9
                                                                 | a == 't'  = go as t10
                                                                 | otherwise = go as t11

        go _ _ = False

12-branchs custom tree (WordTree412)

member v0 tree = go v0 tree
    where
        go _  Tip                                                         = False
        go _  t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
        go [] t@(Fork vs _   _   _   _   _   _   _   _   _   _   _   _  ) = v0 `elem` vs

        go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12) | a <= 'c'  = go as t1
                                                                     | a <= 'e'  = go as t2
                                                                     | a <= 'h'  = go as t3
                                                                     | a <= 'j'  = go as t4
                                                                     | a <= 'm'  = go as t5
                                                                     | a == 'n'  = go as t6
                                                                     | a <= 'q'  = go as t7
                                                                     | a == 'r'  = go as t8
                                                                     | a == 's'  = go as t9
                                                                     | a == 't'  = go as t10
                                                                     | a == 'u'  = go as t11
                                                                     | otherwise = go as t12

        go _ _ = False

Results

The results are presented in the following reports and summary table:

	List	Set	WordTree 3	WordTree 42	WordTree 43	WordTree 44	WordTree 45	WordTree 46	WordTree 47	WordTree 48	WordTree 49	WordTree 410	WordTree 411	WordTree 412
sec	100,11463	9,08085	6,80515	7,25959	5,98261	5,84642	5,70404	5,69355	5,77493	5,95812	5,85358	5,90960	5,97161	6,05108
xfaster/List	/	11,0248	14,7116	13,7907	16,7343	17,1241	17,5515	17,5839	17,3361	16,8030	17,1031	16,9410	16,7651	16,5449
%lower/List	/	-90,93 %	-93,20 %	-92,75 %	-94,02 %	-94,16 %	-94,30 %	-94,31 %	-94,23 %	-94,05 %	-94,15 %	-94,10 %	-94,04 %	-93,96 %
xfaster/Set	/	1,00	1,3344	1,25	1,52	1,55	1,59	1,59	1,57	1,52	1,55	1,54	1,52	1,50
%lower/Set	/	/	-10,00 %	-20,06 %	-34,12 %	-35,62 %	-37,19 %	-37,30 %	-36,41 %	-34,39 %	-35,54 %	-34,92 %	-34,24 %	-33,36 %
xfaster/WordTree3	/	/	1,00	0,94	1,14	1,16	1,19	1,20	1,18	1,14	1,16	1,15	1,14	1,12
%lower/WordTree3	/	/	/	6,68 %	-12,09 %	-14,09 %	-16,18 %	-16,33 %	-15,14 %	-12,45 %	-13,98 %	-13,16 %	-12,25 %	-11,08 %

Conlusion

It can be seen that the processing speed is the lowest with 5 or 6 branches in the tree. The execution speed is then:

About 17 times faster than processing with a list.
1.59 times faster than the Set.
1.17/1.19 times faster than the WordTree 3 structure.

This structure is therefore the most optimized one I could find for word filtering.

I want to remind that this structure, the number of branches and the distribution of letters in these branches is only valid for the word list to be filtered. A different list, in another language, will most certainly give different results and will require readjustments to keep the performance optimal.

And remember also that for most purposes the Set structure provided by the container module of Haskell will fit perfectly, will be more general than the structure presented here and will save development time.

Benchmarks