Two identical regex literals don’t equal each other?
From surprise at the REPL to learning Java implementation decisions.
(= #"." #".")
=> false
Wat?
Per the Equality Guide:
Clojure regex’s, e.g.
#"a.*bc"
, are implemented using Javajava.util.regex.Pattern
objects, and Java’sequals
on two Pattern objects returns(identical? re1 re2)
, even though they are documented as immutable objects. Thus(= #"abc" #"abc")
returnsfalse
, and=
only returns true if two regex’s happen to be the same identical object in memory. Recommendation: Avoid using regex instances inside of Clojure data structures where you want to compare them to each other using=
, and gettrue
as the result even if the regex instances are not identical objects. If you feel the need to, consider converting them to strings first, e.g.(str #"abc") → "abc"
(see CLJ-1182)
OK, but why are java.util.regex.Pattern
s only equal if they’re the same object in memory? Why is regex equality undecidable, as Stu and Rich point out in the linked ticket?
Well, here’s the first line of the Pattern javadoc: “A compiled representation of a regular expression.”
I’d forgotten my CS education: we humans work with “regexes” that look like Strings, but a regular expression itself is a set of instructions for a regex engine. Those #""
literals are patterns which implement the underlying instructions into something useful for the re-*
family of clojure.core functions.
Some proglangs turn the instructions into finite state machines. Java is more clever about it, and compiles them to back-tracking bytecode. So there’s our answer to decidability: asking if two Patterns — chunks of compiled bytecode — are equal is the same as asking if two functions are equal. And while there’s a trivial base case:
(= (fn [x] (inc x))
(fn [x] (inc x)))
The general case, which we’d need to respect if we want behavior to match intent, is harder. Consider:
(= (fn [x] (inc x))
(fn [x] (+ 1 x)))
Anyway, if comparison of the pre-compiled thing is what you want, then j.u.regex.Pattern naturally provides a method to access the String-like expression itself:
(= (.pattern #".") (.pattern #"."))
=> true
…which is equivalent to .toString
/ Clojure’s str
– the recommended workaround!
In the end, my confusion came down to
- forgetting what regexes are,
- Java differentiating between compiled regexes (“patterns”) and the regular expressions behind them
- Clojure (understandably) implementing
#""
literals as the compiled version.
Thus, a situation where the surprising behavior is an emergent property of every involved party making the right choice.
(originally a January 5, 2022 thread on twitter)