bloomfilter Wiki & Documentation Rss Feedhttp://www.codeplex.com/bloomfilter/Wiki/View.aspx?title=Homebloomfilter Wiki Rss DescriptionUpdated Wiki: Homehttps://bloomfilter.codeplex.com/wikipage?version=6<div class="wikidoc">A Bloom filter is a data structure optimized for fast, space-efficient set membership tests. Bloom filters have the unusual property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. No other constant-space set data structure has this property. <br /><br />It works by storing a bit vector representing the set S' = {h[i](x) | x in S, i = 1, …, k}, where h[1], …, h[k] := {0, 1} -> [n lg(1/ε) lg e] are hash functions. Additions are simply setting k bits to 1, specifically those at h[1](x), …, h[k](x). Checks are implemented by performing those same hash functions and returning true if all of the resulting positions are 1. <br /><br />Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified. <br /><br />Bloom filters offer the following advantages:
<ul><li>Space: Approximately n * lg(1/ε), where ε is the false positive rate and n is the number of elements in the set.
<ul><li>Example: There are approximately 170k words in the English language. If we consider that to be our set (therefore n = 1.7E5), and we wish to search a corpus for them with a 1% false positive rate, the filter would require about (1.7E5 * lg(1 / 0.01)) ≈ 162 KB. Contrast this with a hashtable, which would require (1.7E5 elements * 32 bits per element) ≈ 664 KB. Obviously explicit string storage would be significantly more. </li></ul></li>
<li>Precision: Arbitrary precision, where increasing precision requires more space (following the above size equation) but not more time.
<ul><li>Example: If we wanted to reduce our false positive rate in the above example from one percent to one permille the space requirement would go from about 162 KB to about 207 KB. </li></ul></li>
<li>Time: O(k) where k is the number of hash functions. The optimal number of hash functions (though a different number can be supplied by the user if desired) is ceiling(lg(1/ε))
<ul><li>Example: In keeping with our above example, if the accepted false positive rate is 0.001, k = 10. </li></ul></li></ul>
<br />This implementation uses Dillinger & Manolios double hashing to provide all but the first two hash functions. By default the first hash function is the type's GetHashCode() method. This implementation also includes default secondary hash functions for strings (<a href="http://www.burtleburtle.net/bob/hash/doobs.html">Jenkin's "One at a time" method</a>) and integers (<a href="http://burtleburtle.net/bob/hash/integer.html">Wang's method</a>). <br /><br />Bloom filters are due to Burton H. Bloom, as described in the Communications of the ACM in July 1970. The full paper is available <a href="http://portal.acm.org/citation.cfm?doid=362686.362692">here</a>. </div><div class="ClearBoth"></div>JustinRussellTue, 11 Nov 2014 21:38:58 GMTUpdated Wiki: Home 20141111093858PUpdated Wiki: Homehttp://bloomfilter.codeplex.com/wikipage?version=5<div class="wikidoc">A Bloom filter is a data structure optimized for fast, space-efficient set membership tests. Bloom filters have the unusual property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. No other constant-space set data structure has this property. <br /><br />It works by storing a bit vector representing the set S' = {h[i](x) | x in S, i = 1, …, k}, where h[1], …, h[k] := {0, 1} -> [n lg(1/ε) lg e] are hash functions. Additions are simply setting k bits to 1, specifically those at h[1](x), …, h[k](x). Checks are implemented by performing those same hash functions and returning true if all of the resulting positions are 1. <br /><br />Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified. <br /><br />Bloom filters offer the following advantages:
<ul><li>Space: Approximately n * lg(1/ε), where ε is the false positive rate and n is the number of elements in the set.
<ul><li>Example: There are approximately 170k words in the English language. If we consider that to be our set (therefore n = 1.7E5), and we wish to search a corpus for them with a 1% false positive rate, the filter would require about (1.7E5 * lg(1 / 0.01)) ≈ 162 KB. Contrast this with a hashtable, which would require (1.7E5 elements * 32 bits per element) ≈ 664 KB. Obviously explicit string storage would be significantly more. </li></ul></li>
<li>Precision: Arbitrary precision, where increasing precision requires more space (following the above size equation) but not more time.
<ul><li>Example: If we wanted to reduce our false positive rate in the above example from one percent to one permille the space requirement would go from about 162 KB to about 207 KB. </li></ul></li>
<li>Time: O(k) where k is the number of hash functions. The optimal number of hash functions (though a different number can be supplied by the user if desired) is ceiling(lg(1/ε))
<ul><li>Example: In keeping with our above example, if the accepted false positive rate is 0.001, k = 10. </li></ul></li></ul>
<br />This implementation uses Dillinger & Manolios double hashing to provide all but the first two hash functions. By default the first hash function is the type's GetHashCode() method. This implementation also includes default secondary hash functions for strings (Jenkin's "One at a time" method) and integers (Wang's method). <br /><br />Bloom filters are due to Burton H. Bloom, as described in the Communications of the ACM in July 1970. The full paper is available here: <a href="http://portal.acm.org/citation.cfm?doid=362686.362692" class="externalLink">http://portal.acm.org/citation.cfm?doid=362686.362692<span class="externalLinkIcon"></span></a>. </div><div class="ClearBoth"></div>fatcat1111Mon, 02 May 2011 19:35:38 GMTUpdated Wiki: Home 20110502073538PUpdated Wiki: Homehttp://bloomfilter.codeplex.com/wikipage?version=4<div class="wikidoc">A Bloom filter is a data structure optimized for fast, space-efficient set membership tests. Bloom filters have the unusual property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. No other constant-space set data structure has this property. <br /><br />It works by storing a bit vector representing the set S' = {h[i](x) | x in S, i = 1, …, k}, where h[1], …, h[k] := {0, 1} -> [n lg(1/ε) lg e] are hash functions. Additions are simply setting k bits to 1, specifically those at h[1](x), …, h[k](x). Checks are implemented by performing those same hash functions and returning if all of the resulting positions are 1. <br /><br />Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified. <br /><br />Bloom filters offer the following advantages:
<ul><li>Space: Approximately n * lg(1/ε), where ε is the false positive rate and n is the number of elements in the set.
<ul><li>Example: There are approximately 170k words in the English language. If we consider that to be our set (therefore n = 1.7E5), and we wish to search a corpus for them with a 1% false positive rate, the filter would require about (1.7E5 * lg(1 / 0.01)) ≈ 162 KB. Contrast this with a hashtable, which would require (1.7E5 elements * 32 bits per element) ≈ 664 KB. Obviously explicit string storage would be significantly more. </li></ul></li>
<li>Precision: Arbitrary precision, where increasing precision requires more space (following the above size equation) but not more time.
<ul><li>Example: If we wanted to reduce our false positive rate in the above example from one percent to one permille the space requirement would go from about 162 KB to about 207 KB. </li></ul></li>
<li>Time: O(k) where k is the number of hash functions. The optimal number of hash functions (though a different number can be supplied by the user if desired) is ceiling(lg(1/ε))
<ul><li>Example: In keeping with our above example, if the accepted false positive rate is 0.001, k = 10. </li></ul></li></ul>
<br />This implementation uses Dillinger & Manolios double hashing to provide all but the first two hash functions. By default the first hash function is the type's GetHashCode() method. This implementation also includes default secondary hash functions for strings (Jenkin's "One at a time" method) and integers (Wang's method). <br /><br />Bloom filters are due to Burton H. Bloom, as described in the Communications of the ACM in July 1970. The full paper is available here: <a href="http://portal.acm.org/citation.cfm?doid=362686.362692" class="externalLink">http://portal.acm.org/citation.cfm?doid=362686.362692<span class="externalLinkIcon"></span></a>. </div><div class="ClearBoth"></div>fatcat1111Wed, 03 Mar 2010 20:24:55 GMTUpdated Wiki: Home 20100303082455PUpdated Wiki: Homehttp://bloomfilter.codeplex.com/Wiki/View.aspx?title=Home&version=3<div class="wikidoc">A Bloom filter is a data structure optimized for fast, space-efficient set membership tests. Bloom filters have the unusual property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. No other constant-space set data structure has this property. <br /><br />It works by storing a bit vector representing the set S' = {h[i](x) | x in S, i = 1, …, k}, where h[1], …, h[k] := {0, 1} -> [n lg(1/ε) lg e] are hash functions. Additions are simply setting k bits to 1, specifically those at h[1](x), …, h[k](x). Checks are implemented by performing those same hash functions and returning if all of the resulting positions are 1. <br /><br />Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified. <br /><br />Bloom filters offer the following advantages:<br />• Space: Approximately n * lg(1/ε), where ε is the false positive rate and n is the number of elements in the set. <br /> ○ Example: There are approximately 170k words in the English language. If we consider that to be our set (therefore n = 1.7E5), and we wish to search a corpus for them with a 1% false positive rate, the filter would require about (1.7E5 * lg(1 / 0.01)) ≈ 162 KB. Contrast this with a hashtable, which would require (1.7E5 elements * 32 bits per element) ≈ 664 KB. Obviously explicit string storage would be significantly more. <br />• Precision: Arbitrary precision, where increasing precision requires more space (following the above size equation) but not more time. <br /> ○ Example: If we wanted to reduce our false positive rate in the above example from one percent to one permille the space requirement would go from about 162 KB to about 207 KB. <br />• Time: O(k) where k is the number of hash functions. The optimal number of hash functions (though a different number can be supplied by the user if desired) is ceiling(lg(1/ε))<br /> ○ Example: In keeping with our above example, if the accepted false positive rate is 0.001, k = 10. <br /><br />This implementation uses Dillinger & Manolios double hashing to provide all but the first two hash functions. By default the first hash function is the type's GetHashCode() method. This implementation also includes default secondary hash functions for strings (Jenkin's "One at a time" method) and integers (Wang's method). <br /><br />Bloom filters are due to Burton H. Bloom, as described in the Communications of the ACM in July 1970. The full paper is available here: <a href="http://portal.acm.org/citation.cfm?doid=362686.362692" class="externalLink">http://portal.acm.org/citation.cfm?doid=362686.362692<span class="externalLinkIcon"></span></a>. </div>fatcat1111Sun, 26 Apr 2009 05:15:37 GMTUpdated Wiki: Home 20090426051537AUPDATED WIKI: Homehttp://www.codeplex.com/bloomfilter/Wiki/View.aspx?title=Home&version=2<div class="wikidoc">
Here is a summary from <a href="http://en.wikipedia.org/wiki/Bloom_filter" class="externalLink">http://en.wikipedia.org/wiki/Bloom_filter<span class="externalLinkIcon"></span></a>, which has more details:<br /> <br />The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed. The more elements that are added to the set, the larger the probability of false positives.<br /> <br />For example, one might use a Bloom filter to do spell-checking in a space-efficient way. A Bloom filter to which a dictionary of correct words have been added will accept all words in the dictionary and reject almost all words which are not, which is good enough in some cases. Depending on the false positive rate, the resulting data structure can require as little as a byte per dictionary word.<br /> <br />A Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements! This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.<br /> <br />Bloom filters also have the unusual property that the time needed to either add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set. No other constant-space set data structure has this property, but the average access time of sparse hash tables can make them faster in practice than some Bloom filters. <br />
</div>fatcat1111Mon, 25 Jun 2007 00:06:21 GMTUPDATED WIKI: Home 20070625120621A