{"id":2646,"date":"2023-05-25T10:11:41","date_gmt":"2023-05-25T15:11:41","guid":{"rendered":"http:\/\/www.ishygddt.xyz\/~blog\/?p=2646"},"modified":"2023-05-26T15:54:19","modified_gmt":"2023-05-26T20:54:19","slug":"python-multiset","status":"publish","type":"post","link":"http:\/\/www.ishygddt.xyz\/~blog\/2023\/05\/python-multiset","title":{"rendered":"Python has no(?) multiset"},"content":{"rendered":"<p>I saw an online resource that claimed \u201cThe <code class=\"language-python\" data-line=\"\">collections.Counter<\/code> class in the Python standard library implements a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multiset\">multiset<\/a> (or bag) type that allows elements in the set to have more than one occurrence.\u201d<\/p>\n<p>However, (and despite the fact that he is quoting <a href=\"https:\/\/github.com\/python\/cpython\/blob\/v3.11.3\/Lib\/collections\/__init__.py#L534\">comments in the CPython source code<\/a>,) this is <strong>misleading<\/strong>. While the <code class=\"language-python\" data-line=\"\">collections.Counter<\/code> class <em>does<\/em> allow you to have more than 1 occurrence of elements in a set, it has one huge difference with \"real\" multisets: it allows <strong>non-positive quantities<\/strong> for elements!<\/p>\n<p>This leads to two disturbing situations if you just believe the claims of whatever the most SEO'd post to your query was:<\/p>\n<pre><code class=\"language-python\" data-line=\"\">from collections import Counter\n\n# 1. Counter is NOT a multiset because it counts quantity-0 members as &quot;present&quot;\nc = Counter([&#039;cat&#039;, &#039;dog&#039;])\nc.subtract([&#039;cat&#039;])\nassert &#039;cat&#039; not in c, &quot;True: &#039;cat&#039; in c and c[&#039;cat&#039;] == {}&quot;.format(c[&#039;cat&#039;])\n\n# 2. Counter is NOT a multiset because it supports negative-quantity membership\nc = Counter([&#039;cat&#039;, &#039;dog&#039;])\nc.subtract([&#039;cat&#039;])\nc.subtract([&#039;cat&#039;])\nc.subtract([&#039;cat&#039;])\nassert &#039;cat&#039; not in c, &quot;True: c[&#039;cat&#039;] == {}&quot;.format(c[&#039;cat&#039;])\n<\/code><\/pre>\n<p>I think Python does <em>not<\/em> currently contain a multiset type. <a href=\"https:\/\/mail.python.org\/archives\/list\/python-dev@python.org\/thread\/5GT4P5KJLXKVAWT3RFRCVDFYKOBGYTVX\/\">Reportedly<\/a>, the idea was pondered in 2004 around the Python 2.4 release, but as of Python 3.11.3 \/ Summer 2023, <a href=\"https:\/\/docs.python.org\/3\/search.html?q=multiset\">the documentation only contains 2 instances of the word \"multiset\"<\/a>, both of which refer to the Counter class.<\/p>\n<p>To be clear, you <em>could<\/em> simply decide to use a Counter as a multiset. It would have a few downsides:<\/p>\n<ul>\n<li><abbr title=\"Easier to Ask Forgiveness than Permission\u2014preferring semantic exceptions over in-codebase prechecks for \u201cnon-happy\u201d codepaths\">EAFP<\/abbr> is <em>utterly<\/em> broken, because you'll be put in an invalid state <strong>with no raised error<\/strong> when trying to <code class=\"language-python\" data-line=\"\">subtract<\/code> something that's already at 0 quantity.<\/li>\n<li>Since <code class=\"language-python\" data-line=\"\">Counter<\/code> doesn't prune elements whose quantity has dropped to zero, using it \"as\" a multiset could produce memory leaks in certain situations.\n<ul>\n<li>You can work around this by calling <code class=\"language-python\" data-line=\"\">c._keep_positive<\/code> after every mutation which might remove an element.<\/li>\n<\/ul>\n<\/li>\n<li>Unless you implement the above workaround for memory-leakage, the <code class=\"language-python\" data-line=\"\">in<\/code> operator won't work properly; you'll have to use <code class=\"language-python\" data-line=\"\">lambda x: c[x] &gt; 0<\/code> to check <em>membership<\/em>, which is cringe.<\/li>\n<\/ul>\n<p>But if all you need from a multiset is <code class=\"language-python\" data-line=\"\">__contains__<\/code>, <code class=\"language-python\" data-line=\"\">count<\/code>, <code class=\"language-python\" data-line=\"\">append<\/code>, and <code class=\"language-python\" data-line=\"\">remove<\/code>, why not just use a boring old <code class=\"language-python\" data-line=\"\">list<\/code> instead?<br \/>\n(*Rhetorical question; I'll just tell you.)<\/p>\n<ul>\n<li><strong>Comparisons won't work properly<\/strong>, since <code class=\"language-python\" data-line=\"\">list<\/code> is arbitrarily ordered.\n<ul>\n<li>If all elements are from the same <a href=\"https:\/\/en.wikipedia.org\/wiki\/Well-order\">well-ordered set<\/a>, you can work around this by calling <code class=\"language-python\" data-line=\"\">l.sort<\/code> on both operands before comparison!<\/li>\n<\/ul>\n<\/li>\n<li><code class=\"language-python\" data-line=\"\">list<\/code> won't throw a <code class=\"language-python\" data-line=\"\">TypeError<\/code> if you accidentally try to stick something mutable into it. (Obviously, this shouldn't be a concern if your type hygiene is good elsewhere or your code certainly doesn't have typing issues, but it still would have been nice to have.)<\/li>\n<li>The name <code class=\"language-python\" data-line=\"\">append<\/code> is a bit weird when you're semantically <code class=\"language-python\" data-line=\"\">add<\/code>-ing an element.<\/li>\n<li>There's no <code class=\"language-python\" data-line=\"\">subtract<\/code> function to remove multiple elements at the same time.<\/li>\n<li>Memory use won't be optimal when your multisets contain many repeated elements (Python needs to allocate 1 word to store a pointer for each slot in a <code class=\"language-python\" data-line=\"\">list<\/code>.)<\/li>\n<li>Performance won't be optimal (e.g. <code class=\"language-python\" data-line=\"\">count<\/code> is <span class=\"language-mathjax\">$O(k)$<\/span>\u2014that is, not proportional to the number of <em>unique<\/em> elements in the set, but to the <em>total<\/em> number of elements in it\u2014with no optimization available when counting multiple elements at once.)<\/li>\n<\/ul>\n<p>The observant reader will notice that the only \u201cobjective\u201d problem there is comparisons\u2014you <em>really can<\/em>, mathematically, use \u201csorted <code class=\"language-python\" data-line=\"\">list<\/code>s\u201d as a multiset!<\/p>\n<hr \/>\n<p>Alternatively, just step up and make a constrained subclass for <code class=\"language-python\" data-line=\"\">Counter<\/code>:<\/p>\n<pre><code class=\"language-python\" data-line=\"\">from collections import Counter\nfrom collections.abc import Mapping as _Mapping\nfrom itertools import chain as _chain, repeat as _repeat, starmap as _starmap\n\nclass multiset(Counter):\n\tdef _keep_positive(self, \/):\n\t\tnonpositive_items = [(elem, count) for elem, count in self.items() if not count &gt; 0]\n\t\tfor elem, count in nonpositive_items:\n\t\t\tif count == 0:  # https:\/\/en.wikipedia.org\/wiki\/Happy_path\n\t\t\t\tdel self[elem]\n\t\t\t\tcontinue\n\t\t\traise RuntimeError(&quot;invalid quantity in multiset&quot;)\n\t\treturn self\n\tdef update(self, iterable=None, \/, **k):\n\t\tsuper().update(iterable, **k)\n\t\tself._keep_positive()\n\tdef elements(self, \/, _most_common=False):\n\t\t# https:\/\/github.com\/python\/cpython\/blob\/v3.11.3\/Lib\/collections\/__init__.py#L624\n\t\titems = self.items() if not _most_common else self.most_common()\n\t\treturn _chain.from_iterable(_starmap(_repeat, items))\n\tdef __repr__(self, \/):\n\t\t# https:\/\/github.com\/python\/cpython\/blob\/v3.11.3\/Lib\/collections\/__init__.py#L731\n\t\tif not self:\n\t\t\treturn f&#039;{self.__class__.__name__}()&#039;\n\t\ttry:\n\t\t\tl = list(self.elements(_most_common=True))\n\t\texcept TypeError:\n\t\t\tl = list(self.elements())\n\t\treturn f&#039;{self.__class__.__name__}({l!r})&#039;\n\t# This isn&#039;t &quot;a dict&quot;; it HAPPENS to SUBCLASS CPython&#039;s dict for performance reasons.\n\t# We&#039;re not barbarians, so let&#039;s get some set-like methods, eh?\n\tdef __iter__(self, \/):\n\t\tyield from self.elements()\n\tdef __len__(self, \/):\n\t\t&quot;Return the number of elements in multiset *self* (cardinality of *self*).&quot;\n\t\t# https:\/\/docs.python.org\/3\/library\/stdtypes.html#set.__len__\n\t\t# https:\/\/github.com\/python\/cpython\/blob\/v3.11.3\/Lib\/collections\/__init__.py#L604\n\t\treturn self.total()\n\tdef add(self, elem, \/):\n\t\t&quot;Add a single element *elem* to the multiset.&quot;\n\t\tself.update([elem])\n\tdef subtract(self, mapping_or_iterable=None, \/, **k):\n\t\t# arguably a footgun method\n\t\t# if the user wants to self -= other after asserting that self &gt; other, they can do that\n\t\traise NotImplementedError(&quot;method not available for multiset. Did you mean multiset.difference_update?&quot;)\n\tdef difference_update(self, mapping_or_iterable=None, \/, **k):\n\t\t&quot;Update the multiset, removing elements found in others.&quot;\n\t\t# https:\/\/docs.python.org\/3\/library\/stdtypes.html#set.difference_update\n\t\t# https:\/\/github.com\/python\/cpython\/blob\/v3.11.3\/Lib\/collections\/__init__.py#L692\n\t\t# https:\/\/github.com\/python\/cpython\/blob\/v3.11.3\/Lib\/collections\/__init__.py#L926\n\t\tif iterable is not None:\n\t\t\tif isinstance(mapping_or_iterable, _Mapping):\n\t\t\t\titems = mapping_or_iterable.items()\n\t\t\telse:\n\t\t\t\titems = zip(mapping_or_iterable, _repeat(1))\n\t\t\tfor elem, count in mapping_or_iterable:\n\t\t\t\tself[elem] = max(0, self[elem] - count)\n\t\tif k:\n\t\t\tself.difference_update(k)\n\t\telse:\n\t\t\t# No need to run this twice\n\t\t\tself._keep_positive()\n\tdef remove(self, elem, \/):\n\t\t&quot;Remove a single element *elem* from the multiset. Raises `KeyError` if *elem* is not present in the multiset.&quot;\n\t\t# https:\/\/docs.python.org\/3\/library\/stdtypes.html#set.remove\n\t\tif elem not in self:\n\t\t\traise KeyError(f&quot;{self.__class__.__name__}.remove(elem): elem not in multiset.&quot;)\n\t\tself.difference_update([elem])\n\tdef discard(self, elem, \/):\n\t\t&quot;Remove all copies of element *elem* from the multiset if it is present.&quot;\n\t\t# https:\/\/docs.python.org\/3\/library\/stdtypes.html#set.discard\n\t\tdel self[elem]\n\tdef pop(self, \/):\n\t\t&quot;Remove and return a single arbitrary element from the multiset. Raises `KeyError` if the multiset is empty.&quot;\n\t\t# https:\/\/docs.python.org\/3\/library\/stdtypes.html#set.pop\n\t\tif not self:\n\t\t\traise KeyError(&quot;pop from an empty multiset&quot;)\n\t\telem = next(iter(self))\n\t\tself.remove([elem])\n\t\treturn elem\n\nclass frozenmultiset(frozenset):\n\tdef __init__(self, *a, **k):\n\t\traise NotImplementedError(&quot;TODO&quot;)\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I saw an online resource that claimed \u201cThe collections.Counter class in the Python standard library implements a multiset (or bag) type that allows elements in the set to have more than one occurrence.\u201d However, (and despite the fact that he is quoting comments in the CPython source code,) this is misleading. While the collections.Counter class &hellip;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-2646","post","type-post","status-publish","format-standard","hentry","category-drafts"],"_links":{"self":[{"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/posts\/2646","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/comments?post=2646"}],"version-history":[{"count":29,"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/posts\/2646\/revisions"}],"predecessor-version":[{"id":2697,"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/posts\/2646\/revisions\/2697"}],"wp:attachment":[{"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/media?parent=2646"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/categories?post=2646"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.ishygddt.xyz\/~blog\/wp-json\/wp\/v2\/tags?post=2646"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}