For one of my programs, I have to filter and delete some words from a long text from a long text. I need to remove pronouns, conjugated verbs, determiners and some other words. some other words.
In the previous tutorial, I presented and compared several tree-like data structures to speed up the sorting of these words.
In order to complete this comparison, I wanted to test a last structure by taking into account the last results
In my opinion, the idea of putting several letters in a single branch as I did with the WordTree2g and WordTree3g trees is the right one. The bad performance of these trees comes from the distribution test itself. Indeed to select the right letters to put in each branch, I proceeded to a test with the elem
function which slows down a lot the program.
The solution would be to replace this function with a simpler test that is less time consuming. For example a test of equality ==
or of order <
, >
, <=
or >=
. But if we want to do this, we will have to perform the distribution tests according to the alphabetical order of the letters. The idea of using a specific letter order will therefore be abandoned. On the other hand, the grouping of letters will be specific and adapted to the words to be sorted.
By taking again the statistics of the letters, we will have:
A 77
B 4
C 34
D 27
E 133
F 19
G 9
H 4
I 85
J 5
L 46
M 18
N 59
O 61
P 21
Q 21
R 49
S 103
T 90
U 102
V 21
X 9
Y 4
Z 8
For a total of 1009 letters.
For each version, we will place the cuts in order to minimize the difference in the number of elements in each branch.
If we cut at the letter M (included), we will have :
Branch |
Nb of elems |
---|---|
A-M |
461 |
N-Z |
548 |
The maximum deviation with a perfect distribution of the elements is
If we cut at the letter N (included), we will have :
Branch |
Nb of elems |
---|---|
A-M |
520 |
N-Z |
489 |
The maximum deviation with a perfect distribution of the elements is
The gap is therefore smaller with a cut at the letter N (included).
And we will apply this method to all other trees.
Several tests will be performed with a different number of branches in order to see the influence of this number on the performance of the tree.
We will have the following implementation:
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2) | a <= 'l' = go as t1
| otherwise = go as t2
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3) | a <= 'h' = go as t1
| a <= 'r' = go as t2
| otherwise = go as t3
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4) | a <= 'e' = go as t1
| a <= 'n' = go as t2
| a <= 's' = go as t3
| otherwise = go as t4
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5) | a <= 'd' = go as t1
| a <= 'i' = go as t2
| a <= 'q' = go as t3
| a <= 't' = go as t4
| otherwise = go as t5
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6) | a <= 'd' = go as t1
| a <= 'h' = go as t2
| a <= 'm' = go as t3
| a <= 'r' = go as t4
| a <= 't' = go as t5
| otherwise = go as t6
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7) | a <= 'd' = go as t1
| a <= 'g' = go as t2
| a <= 'm' = go as t3
| a <= 'q' = go as t4
| a <= 's' = go as t5
| a == 't' = go as t6
| otherwise = go as t7
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8) | a <= 'd' = go as t1
| a <= 'f' = go as t2
| a <= 'l' = go as t3
| a <= 'o' = go as t4
| a <= 'r' = go as t5
| a == 's' = go as t6
| a == 't' = go as t7
| otherwise = go as t8
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9) | a <= 'c' = go as t1
| a <= 'e' = go as t2
| a <= 'i' = go as t3
| a <= 'm' = go as t4
| a <= 'o' = go as t5
| a <= 'r' = go as t6
| a == 's' = go as t7
| a == 't' = go as t8
| otherwise = go as t9
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10) | a <= 'c' = go as t1
| a <= 'e' = go as t2
| a <= 'i' = go as t3
| a <= 'm' = go as t4
| a == 'n' = go as t5
| a <= 'q' = go as t6
| a == 'r' = go as t7
| a == 's' = go as t8
| a == 't' = go as t9
| otherwise = go as t10
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11) | a <= 'c' = go as t1
| a <= 'e' = go as t2
| a <= 'h' = go as t3
| a <= 'j' = go as t4
| a <= 'm' = go as t5
| a == 'n' = go as t6
| a <= 'q' = go as t7
| a == 'r' = go as t8
| a == 's' = go as t9
| a == 't' = go as t10
| otherwise = go as t11
go _ _ = False
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12) | a <= 'c' = go as t1
| a <= 'e' = go as t2
| a <= 'h' = go as t3
| a <= 'j' = go as t4
| a <= 'm' = go as t5
| a == 'n' = go as t6
| a <= 'q' = go as t7
| a == 'r' = go as t8
| a == 's' = go as t9
| a == 't' = go as t10
| a == 'u' = go as t11
| otherwise = go as t12
go _ _ = False
The results are presented in the following reports and summary table:
List |
Set |
WordTree 3 |
WordTree 42 |
WordTree 43 |
WordTree 44 |
WordTree 45 |
WordTree 46 |
WordTree 47 |
WordTree 48 |
WordTree 49 |
WordTree 410 |
WordTree 411 |
WordTree 412 |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sec |
100,11463 |
9,08085 |
6,80515 |
7,25959 |
5,98261 |
5,84642 |
5,70404 |
5,69355 |
5,77493 |
5,95812 |
5,85358 |
5,90960 |
5,97161 |
6,05108 |
xfaster/List |
/ |
11,0248 |
14,7116 |
13,7907 |
16,7343 |
17,1241 |
17,5515 |
17,5839 |
17,3361 |
16,8030 |
17,1031 |
16,9410 |
16,7651 |
16,5449 |
%lower/List |
/ |
-90,93 % |
-93,20 % |
-92,75 % |
-94,02 % |
-94,16 % |
-94,30 % |
-94,31 % |
-94,23 % |
-94,05 % |
-94,15 % |
-94,10 % |
-94,04 % |
-93,96 % |
xfaster/Set |
/ |
1,00 |
1,3344 |
1,25 |
1,52 |
1,55 |
1,59 |
1,59 |
1,57 |
1,52 |
1,55 |
1,54 |
1,52 |
1,50 |
%lower/Set |
/ |
/ |
-10,00 % |
-20,06 % |
-34,12 % |
-35,62 % |
-37,19 % |
-37,30 % |
-36,41 % |
-34,39 % |
-35,54 % |
-34,92 % |
-34,24 % |
-33,36 % |
xfaster/WordTree3 |
/ |
/ |
1,00 |
0,94 |
1,14 |
1,16 |
1,19 |
1,20 |
1,18 |
1,14 |
1,16 |
1,15 |
1,14 |
1,12 |
%lower/WordTree3 |
/ |
/ |
/ |
6,68 % |
-12,09 % |
-14,09 % |
-16,18 % |
-16,33 % |
-15,14 % |
-12,45 % |
-13,98 % |
-13,16 % |
-12,25 % |
-11,08 % |
It can be seen that the processing speed is the lowest with 5 or 6 branches in the tree. The execution speed is then:
About 17 times faster than processing with a list.
1.59 times faster than the Set.
1.17/1.19 times faster than the WordTree 3 structure.
This structure is therefore the most optimized one I could find for word filtering.
I want to remind that this structure, the number of branches and the distribution of letters in these branches is only valid for the word list to be filtered. A different list, in another language, will most certainly give different results and will require readjustments to keep the performance optimal.
And remember also that for most purposes the Set
structure provided by the container
module of Haskell will fit perfectly, will be more general than the structure presented here and will save development time.