The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bloom::Scalable - Implementation of the probalistic datastructure - ScalableBloomFilter

VERSION

Version 0.01

SYNOPSIS

  use Bloom::Scalable;
  
  my $scalable_bloom = new Bloom::Scalable();
  open my $fh, "</usr/share/dict/words" or die "couldn't open dict";
  my @words = map { chomp; $_ } <$fh>;

  foreach my $word ( @words[0..100] ) {
    $scalable_bloom->add($word);
  }
  # check the presence of a random word in the Bloom Filter
  my $random_index = int(rand(100));
  $scalable_bloom->contains($words[$random_index]);

DESCRIPTION

Bloom Filters were around since 1970 as a probabilistic datastructures primarily used for their property of configurable false positives but *no* false negatives.

While risking false positives, Bloom filters have a strong space advantage over other data structures for representing sets. A Bloom filter with 1% error and an optimal value of k(number of filters) requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per element.

Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set. No other constant-space set data structure has this property,

Scalable Bloom Filter entered the stage a bit later, primarily due to the work of Paulo Sergio Almeida, Carlos Baquero, Nuno Preguica(Lancaster University). Refer to their research paper titled - "Scalable Bloom Filter" Their rationale was a Scalable Bloom Filter could be created by chaining together filters with decreasing error probabilities so that that the entire datastructure respects the predefined false positive probability agreement. The factor for decreasing the false positive probability as per their work was determined as between 0.8 - 0.9.

Murmur2 Hashing has been used in this module primarily for its speed. The quality of the hashing has been further fine tuned by using the technique defined in the paper by Kirsch, Adam; Mitzenmacher, Michael (2006), "Less Hashing, Same Performance: Building a Better Bloom Filter"

Persisting the BloomFilter, using the Perl MultiCoreEngine are a couple of bells and whistles added for ease and performance.

LIMITATIONS

Currently the module is not thread safe, its the clients responsibility to streamline incoming adds. This implmentation doesn't support Counting Bloom Filter hence doesn't allow deletes. This was more a choice as adding deletions unnecessarily complicates the code. The solution is to build a fresh BloomFilter and replacing the persistent file with this new file.

Ideally the complete implementation in C with a Perl XS wrapper would be the fastest solution and hence this pure Perl implementation might be slower than the native solution... Saying this, my statment needs to be benchmarked to find out the XS speedup.

AUTHOR

Subhojit Banerjee, <subhojit20 at gmail.com>

BUGS

Please report any bugs or feature requests to bug-bloom-scalable at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bloom-Scalable. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Bloom::Scalable

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2014 Subhojit Banerjee.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0

Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 129:

Non-ASCII character seen before =encoding in '—'. Assuming UTF-8