

#RUBY CODEPOINTS CODE#
However, if lstrip and rstrip ever evolve to handle non-ASCII whitespace we'll be back to calculating code point boundaries.ĭespite rstrip doing it in some cases, I don't think removing the invalid code points is what an end user would expect and runs counter to the method's documentation. Only ASCII whitespace code points are stripped and those, by definition, are only one byte wide. An additional benefit is the method could be implemented more efficiently as the whitespace check can be done without calculating code point boundaries. Treating broken code points the same as any other non-whitespace code point would be logically consistent. The documentation for lstrip is "Returns a copy of the receiver with leading whitespace removed." It seems fairly straightforward and there's no mention of string validation raising an exception might violate user expectations.

My own take on three options, with no significance to the order, are: What that behavior should be needs to be decided and I'm hoping to reach consensus on the semantics in this issue. I think it'd make for a better user experience if lstrip and rstrip behaved consistently with each other, which would then unify the behavior in rstrip. E.g., rb_str_lstrip will call rb_enc_codepoint_len, which raises on invalid code points, while rb_str_rstrip calls rb_enc_prev_char, which doesn't perform the same code point validation. It looks to me like the current behavior is a byproduct of the functions chosen for finding code point boundaries, rather than something deliberately chosen. e:1:in `rstrip': invalid byte sequence in UTF-8 (ArgumentError) String#rstrip will raise an exception if no valid, non-whitespace code points appear before it: > ruby -v String#rstrip will remove the invalid code point if it is surround by whitespace: > ruby -v String#rstrip will ignore the invalid code point if it immediately follows a non-whitespace code point: > ruby -v Depending on context, rstrip may raise an exception, treat the broken code point as a non-whitespace boundary and stop processing, or treat the broken code point as if it were whitespace and remove it.

Things get a lot messier with String#rstrip, however. "a\x80bc" # This one is okay because the broken code point appears after a non-whitespace code point. e:1:in `lstrip': invalid byte sequence in UTF-8 (ArgumentError) String#lstrip will unconditionally raise an invalid byte sequence error: > ruby -v What it does when such a code point is encountered, however, is not consistent between lstrip and rstrip. It permits stripping strings with a ENC_CODERANGE_BROKEN so long as any invalid code points are not encountered while performing the loop to remove whitespace. When attempting to strip a string, there are three basic options when an invalid code point is encountered:įor background, Ruby does not consider the string's code range for lstrip or rstrip.
