diff --git a/NEWS.rst b/NEWS.rst index 762acaa..f1b644c 100644 --- a/NEWS.rst +++ b/NEWS.rst @@ -15,6 +15,10 @@ Note: this version is not yet released! * Regenerated bundled Thrift code using Thrift 0.9.0 with the new-style classes flag (`issue #27 `_). +* The new :py:mod:`happybase.filter` module provides improved support for + specifying scanner filters that should be applied at the region server. See + the tutorial and API docs for more information. + HappyBase 0.5 ------------- diff --git a/doc/api.rst b/doc/api.rst index 2b1e3b7..87c2a94 100644 --- a/doc/api.rst +++ b/doc/api.rst @@ -29,6 +29,11 @@ The HappyBase API is organised as follows: The :py:class:`ConnectionPool` class implements a thread-safe connection pool that allows an application to (re)use multiple connections. +:py:mod:`~happybase.filter`: + The :py:mod:`happybase.filter` module provides various helper routines to + construct filter strings to be used with the `filter` argument to + :py:meth:`Table.scan()`. + Connection ========== @@ -56,4 +61,34 @@ Connection pool .. autoclass:: happybase.NoConnectionsAvailable +Scanner filters +=============== + +.. autofunction:: happybase.filter.escape + +.. autofunction:: happybase.filter.make_filter + + +The following filters are defined by default: + +.. class:: happybase.filter.KeyOnlyFilter +.. class:: happybase.filter.FirstKeyOnlyFilter +.. class:: happybase.filter.PrefixFilter +.. class:: happybase.filter.ColumnPrefixFilter +.. class:: happybase.filter.MultipleColumnPrefixFilter +.. class:: happybase.filter.ColumnCountGetFilter +.. class:: happybase.filter.PageFilter +.. class:: happybase.filter.ColumnPaginationFilter +.. class:: happybase.filter.InclusiveStopFilter +.. class:: happybase.filter.TimeStampsFilter +.. class:: happybase.filter.RowFilter +.. class:: happybase.filter.FamilyFilter +.. class:: happybase.filter.QualifierFilter +.. class:: happybase.filter.QualifierFilter +.. class:: happybase.filter.ValueFilter +.. class:: happybase.filter.DependentColumnFilter +.. class:: happybase.filter.SingleColumnValueFilter +.. class:: happybase.filter.SingleColumnValueExcludeFilter +.. class:: happybase.filter.ColumnRangeFilter + .. vim: set spell spelllang=en: diff --git a/doc/user.rst b/doc/user.rst index 57964aa..f52d277 100644 --- a/doc/user.rst +++ b/doc/user.rst @@ -274,15 +274,18 @@ starting with `abc`:: print key, data The scanner examples above only limit the results by row key using the -`row_start`, `row_stop`, and `row_prefix` arguments, but scanners can also -limit results to certain columns, column families, and timestamps, just like -:py:meth:`Table.row` and :py:meth:`Table.rows`. For advanced users, a filter -string can be passed as the `filter` argument. Additionally, the optional +`row_start`, `row_stop`, and `row_prefix` arguments, but scanners can also limit +results to certain columns, column families, and timestamps, just like +:py:meth:`Table.row` and :py:meth:`Table.rows`. Additionally, the optional `limit` argument defines how much data is at most retrieved, and the `batch_size` argument specifies how big the transferred chunks should be. The :py:meth:`Table.scan` API documentation provides more information on the supported scanner options. +Scanners support more advanced filtering techniques by applying filters at the +region servers. See the section on advanced filtering elsewhere in this tutorial +to learn how to use this feature using HappyBase. + Manipulating data ================= @@ -468,7 +471,6 @@ methods can be used to retrieve or set a counter value directly:: :py:meth:`~Table.counter_dec` instead! - Using the connection pool ========================= @@ -555,6 +557,73 @@ operations. This means that the application still has to handle connection errors. +Advanced scanner filters +======================== + +In addition to the scanner features described earlier, HBase can filter scanner +results by applying additional filters at the region servers (predicate +push-down). To use this advanced feature from HappyBase, you can provide a +filter string describing the server-side filters and pass it as the `filter` +argument to :py:class:`Table.scan()`. Example:: + + scanner = table.scanner( + row_start=b'aaa', + row_start=b'eee', + filter=b'KeyOnlyFilter() AND FirstKeyOnlyFilter()', + ) + n_rows = 0 + for row, data in scanner: + n_rows += 1 + +See the HBase documentation for the supported filters and the supported +parameters. See the HBase Thrift documentation for more information about the +filter string syntax. + +Keep in mind that filter strings should be used in *addition* to other ways to +limit the returned scanner data, e.g. by using `row_start` or `columns`. Not +doing so results in horribly slow full table scans at the server. See the HBase +documentation for more information on properly using scanner filters. + +Dynamic filter strings +---------------------- + +For many use cases a literal filter string like the one in the example above +will suffice, but in some cases you might want to programmatically build filter +strings to pass to the Thrift server. This is where the +:py:mod:`happybase.filter` module comes into play. This module provides various +helper routines to build filter strings. For example, to construct a filter +string for a ``QualifierFilter`` you can use something like this:: + + from happybase.filter import QualifierFilter, LESS_OR_EQUAL + + qual = b‘column1’ + f = QualifierFilter(LESS_OR_EQUAL, qual) + scanner = table.scan(row_prefix=b'...', filter=f) + +Note that HappyBase does not include any filtering logic itself. HappyBase does +not check the validity (names and arguments) of the generated filter string, but +only helps with serialising the filter names and properly escaping the arguments +passed to it. + +TODO: it handles bool, int and str automatically + +Using custom filters +-------------------- + +In case you have implemented a custom filter and loaded it in HBase, you can +easily add support for it in HappyBase:: + + from happybase.filter import make_filter, EQUAL + MyCustomFilter = make_filter('MyCustomFilter') + +You can now use the custom filter exactly like the filters provided by default. +If the filter accepts an integer, a comparison operator and a string, you can +use it as follows:: + + f = MyCustomFilter(1, EQUAL, 'foobar') + scanner = table.scan(row_prefix=b'...', filter=f) + + .. rubric:: Next steps The next step is to try it out for yourself! The :doc:`API documentation ` diff --git a/happybase/filter.py b/happybase/filter.py new file mode 100644 index 0000000..d81e80e --- /dev/null +++ b/happybase/filter.py @@ -0,0 +1,216 @@ +""" +Filter module. + +This module provides helper routines to construct Thrift filter strings. +""" + +# TODO: add support for comparators (regex, substring, and so on) + +from __future__ import unicode_literals as _unicode_literals +from functools import partial as _partial + + +LESS = LT = object() +LESS_OR_EQUAL = LE = object() +EQUAL = EQ = object() +NOT_EQUAL = NE = object() +GREATER_OR_EQUAL = GE = object() +GREATER = GT = object() +NO_OP = object() + +_COMPARISON_OPERATOR_STRINGS = { + LESS: '<', + LESS_OR_EQUAL: '<=', + EQUAL: '=', + NOT_EQUAL: '!=', + GREATER_OR_EQUAL: '>=', + GREATER: '>', + NO_OP: '', +} + + +def escape(s): + """Escape a byte string for use in a filter string. + + :param str host: The byte string to escape + :return: Escaped string + :rtype: str + """ + + if not isinstance(s, bytes): + raise TypeError("Only byte strings can be escaped") + + return s.replace(b"'", b"''") + + +def _format_arg(arg): + if isinstance(arg, bool): + return b'true' if arg else b'false' + + if isinstance(arg, int): + return bytes(arg) + + if arg in _COMPARISON_OPERATOR_STRINGS: + return _COMPARISON_OPERATOR_STRINGS[arg] + + if isinstance(arg, bytes): + # TODO: what to do with already escaped strings? + return "'%s'" % escape(arg) + + raise TypeError( + "Filter arguments must be booleans, integers, comparison " + "operators or byte strings; got %r" % arg) + + +# +# Internal node classes +# + +class _Node(object): + def __and__(self, other): + return AND(self, other) + + def __or__(self, rhs): + return OR(self, rhs) + + +class _FilterNode(_Node): + """Client-side Filter representation. + + This class does not have any filtering logic; it is only used to + build filter strings that the HBase Thrift server can parse and + apply. + """ + def __init__(self, name, *args): + + if isinstance(name, unicode): + name = name.encode('ascii') + + if not isinstance(name, bytes): + raise TypeError("Filter name must be a string") + + self.name = name + self.args = map(_format_arg, args) + + def __str__(self): + return b'%s(%s)' % (self.name, ', '.join(self.args)) + + +class _UnaryOperatorNode(_Node): + def __init__(self, value): + if not isinstance(value, _FilterNode): + raise TypeError( + "'SKIP' and 'WHILE' can only be applied to Filters; " + "got %r" % value) + + self.value = value + + def __str__(self): + return b'%s %s' % (self.operator, self.value) + + +class _SkipNode(_UnaryOperatorNode): + operator = 'SKIP' + + +class _WhileNode(_UnaryOperatorNode): + operator = 'WHILE' + + +class _BooleanOperatorNode(_Node): + def __init__(self, *operands): + self.operands = operands + + def __str__(self): + glue = b' %s ' % self.operator + return glue.join(map(bytes, self.operands)) + + def _extend(self, other): + if isinstance(other, self.__class__): + operands = self.operands + (other,) + return self.__class__(*operands) + else: + return self.__class__(self, other) + + +class _AndNode(_BooleanOperatorNode): + operator = 'AND' + + def __and__(self, other): + return self._extend(other) + + +class _OrNode(_BooleanOperatorNode): + operator = 'OR' + + def __or__(self, other): + return self._extend + + +# +# Public API for constructing nodes +# + +def SKIP(f): + return _SkipNode(f) + + +def WHILE(f): + return _WhileNode(f) + + +def AND(*operands): + return _AndNode(*operands) + + +def OR(*operands): + return _OrNode(*operands) + + +def make_filter(name): + """Define a new filter with the specified name. + + Use this function to specify custom filters that are not included by + default, such as custom filters you wrote yourself and made + available in the HBase server (or newly added filters that are not + yet in HappyBase). + + The callable returned by this function can be used just like the + built-in filters. + + Example:: + + MyCustomFilter = make_filter(b'MyCustomFilter') + f = MyCustomFilter(1, b'foo') + table.scan(..., filter=f) + + :param str name: name of the filter + :return: new filter callable + :rtype: filter callable + """ + return _partial(_FilterNode, name) + + +# +# Built-in filters (taken from the Thrift docs) +# + +KeyOnlyFilter = make_filter('KeyOnlyFilter') +FirstKeyOnlyFilter = make_filter('FirstKeyOnlyFilter') +PrefixFilter = make_filter('PrefixFilter') +ColumnPrefixFilter = make_filter('ColumnPrefixFilter') +MultipleColumnPrefixFilter = make_filter('MultipleColumnPrefixFilter') +ColumnCountGetFilter = make_filter('ColumnCountGetFilter') +PageFilter = make_filter('PageFilter') +ColumnPaginationFilter = make_filter('ColumnPaginationFilter') +InclusiveStopFilter = make_filter('InclusiveStopFilter') +TimeStampsFilter = make_filter('TimeStampsFilter') +RowFilter = make_filter('RowFilter') +FamilyFilter = make_filter('FamilyFilter') +QualifierFilter = make_filter('QualifierFilter') +QualifierFilter = make_filter('QualifierFilter') +ValueFilter = make_filter('ValueFilter') +DependentColumnFilter = make_filter('DependentColumnFilter') +SingleColumnValueFilter = make_filter('SingleColumnValueFilter') +SingleColumnValueExcludeFilter = make_filter('SingleColumnValueExcludeFilter') +ColumnRangeFilter = make_filter('ColumnRangeFilter') diff --git a/happybase/table.py b/happybase/table.py index bb32ce6..77756b1 100644 --- a/happybase/table.py +++ b/happybase/table.py @@ -10,6 +10,7 @@ from .hbase.ttypes import TScan from .util import thrift_type_to_dict, str_increment from .batch import Batch +from .filter import _FilterNode logger = logging.getLogger(__name__) @@ -204,8 +205,8 @@ def cells(self, row, column, versions=None, timestamp=None, return map(make_cell, cells) def scan(self, row_start=None, row_stop=None, row_prefix=None, - columns=None, filter=None, timestamp=None, - include_timestamp=False, batch_size=1000, limit=None): + columns=None, timestamp=None, include_timestamp=False, + batch_size=1000, limit=None, filter=None): """Create a scanner for data in the table. This method returns an iterable that can be used for looping over the @@ -230,9 +231,6 @@ def scan(self, row_start=None, row_stop=None, row_prefix=None, The `columns`, `timestamp` and `include_timestamp` arguments behave exactly the same as for :py:meth:`row`. - The `filter` argument may be a filter string that will be applied at - the server by the region servers. - If `limit` is given, at most `limit` results will be returned. The `batch_size` argument specifies how many results should be @@ -240,6 +238,12 @@ def scan(self, row_start=None, row_stop=None, row_prefix=None, this to a low value (or even 1) if your data is large, since a low batch size results in added round-trips to the server. + The `filter` argument may be a filter string that will be + applied at the server by the region servers. If you need more + than a static filter string literal, use the helpers in the + :py:mod:`happybase.filter` module to construct filter strings + programmatically. + **Compatibility note:** The `filter` argument is only available when using HBase 0.92 (or up). In HBase 0.90 compatibility mode, specifying a `filter` raises an exception. @@ -274,6 +278,14 @@ def scan(self, row_start=None, row_stop=None, row_prefix=None, if row_start is None: row_start = '' + if filter is not None: + if isinstance(filter, _FilterNode): + filter = str(filter) + + if not isinstance(filter, str): + raise TypeError( + "'filter' must be a filter instance or a (byte) string") + if self.connection.compat == '0.90': # The scannerOpenWithScan() Thrift function is not # available, so work around it as much as possible with the diff --git a/setup.cfg b/setup.cfg index ed8626d..89befdf 100644 --- a/setup.cfg +++ b/setup.cfg @@ -3,7 +3,7 @@ stop = 1 verbosity = 2 with-coverage = 1 cover-erase = 1 -cover-package=happybase.connection,happybase.table,happybase.batch,happybase.pool,happybase.util,tests +cover-package=happybase.connection,happybase.table,happybase.batch,happybase.pool,happybase.util,happybase.filter,tests cover-tests = 1 cover-html = 1 cover-html-dir = coverage/ diff --git a/tests/test_filter.py b/tests/test_filter.py new file mode 100644 index 0000000..c3d4edb --- /dev/null +++ b/tests/test_filter.py @@ -0,0 +1,158 @@ +""" +HappyBase filter tests. +""" + +from __future__ import unicode_literals + +from nose.tools import assert_equal, assert_raises + +from happybase.filter import ( + AND, + EQUAL, + escape, + GREATER, + GREATER_OR_EQUAL, + LESS, + LESS_OR_EQUAL, + make_filter, + NOT_EQUAL, + OR, + SKIP, + ValueFilter, + WHILE, +) + + +def test_escape(): + + assert_raises(TypeError, escape, u'foo') + assert_raises(TypeError, escape, 42) + assert_raises(TypeError, escape, None) + + def check(original, expected): + actual = escape(original) + assert_equal(actual, expected) + + test_values = [ + (b'', b''), + (b'foo', b'foo'), + (b'\x03\x02\x01\x00', b'\x03\x02\x01\x00'), + (b"foo'ba''r", b"foo''ba''''r"), + ] + + for original, expected in test_values: + yield check, original, expected + + +def test_filter_serialization(): + + # Comparison operators + f = ValueFilter( + LESS, + LESS_OR_EQUAL, + EQUAL, + NOT_EQUAL, + GREATER_OR_EQUAL, + GREATER, + ) + exp = b"ValueFilter(<, <=, =, !=, >=, >)" + assert_equal(exp, bytes(f)) + + # Booleans + f = ValueFilter(True, False) + exp = b"ValueFilter(true, false)" + assert_equal(exp, bytes(f)) + + # Integers + f = ValueFilter(12, 13, -1, 0) + exp = b"ValueFilter(12, 13, -1, 0)" + assert_equal(exp, bytes(f)) + + # Strings + f = ValueFilter(b'foo', b"foo'bar", b'bar') + exp = b"ValueFilter('foo', 'foo''bar', 'bar')" + assert_equal(exp, bytes(f)) + + # Mixed args + assert_equal( + b"ValueFilter(>=, 'foo', 12, 'bar')", + bytes(ValueFilter(GREATER_OR_EQUAL, b'foo', 12, b'bar')) + ) + + +def test_type_checking(): + assert_raises(TypeError, ValueFilter, u'foo') + assert_raises(TypeError, ValueFilter, 3.14) + assert_raises(TypeError, ValueFilter, object()) + assert_raises(TypeError, ValueFilter, None) + + +def test_custom_filter(): + + MyCustomFilter = make_filter('MyCustomFilter') + + assert_equal( + b"MyCustomFilter(1, =, 'foo''bar')", + bytes(MyCustomFilter(1, EQUAL, b"foo'bar")) + ) + + with assert_raises(TypeError): + f = make_filter(None) + f(1, 2) + + with assert_raises(TypeError): + f = make_filter, (12) + f(1, 2) + + +def test_unary_operators(): + + F = make_filter('F') + + assert_equal( + b'SKIP F()', + bytes(SKIP(F())) + ) + + assert_equal( + b'WHILE F()', + bytes(WHILE(F())) + ) + + +def test_boolean_operators(): + + def check(expected, original): + actual = bytes(original) + assert_equal(actual, expected) + + F = make_filter('F') + + f = b'F(1) AND F(2)' + check(f, AND(F(1), F(2))) + check(f, F(1) & F(2)) + + f = b'F(1) OR F(2)' + check(f, OR(F(1), F(2))) + check(f, F(1) | F(2)) + + check( + b'F(1) AND F(2) AND F(3)', + AND(F(1), F(2), F(3)) + ) + + check( + b'F(1) AND F(2) AND F(3)', + F(1) & F(2) & F(3) + ) + + check( + b'F(1) AND F(2) OR F(3)', + F(1) & F(2) | F(3) + ) + + # FIXME: precedence stuff doesn't work correctly + check( + b'F(1) AND (F(2) OR F(3))', + F(1) & (F(2) | F(3)) + )