fixing a bug in simple-rss

Just walking through some bug fixing.

original:

if content =~ /([^-_.!~*'()a-zA-Zd;/?:@&=+$,[]]%)/un then CGI.unescape(content).gsub(/(<![CDATA[|]]>)/u,'').strip else content.gsub(/(<![CDATA[|]]>)/u,'').strip end

the tests in ruby1.9.2 give this error:

simple-rss/lib/simple-rss.rb:155: warning: regexp match /.../n against to UTF-8 string

And this makes sense because /un in the first regex is the same as /n. So it was
altered to:

if content =~ /([^-_.!~*'()a-zA-Zd;/?:@&=+$,[]]%)/u then CGI.unescape(content).gsub(/(<![CDATA[|]]>)/u,'').strip else content.gsub(/(<![CDATA[|]]>)/u,'').strip end

This fixes the errors and looks nice and consistent with all the regexps being utf8.
However, I ran into a problem in production where content was apparently ASCII-8BIT
and got the following error:

incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)

So then I thought why is the /u needed in the first place? So I changed the code to:

if content =~ /([^-_.!~*'()a-zA-Zd;/?:@&=+$,[]]%)/ then CGI.unescape(content).gsub(/(<![CDATA[|]]>)/,'').strip else content.gsub(/(<![CDATA[|]]>)/,'').strip end

This seems to pass tests in 1.8.7 and 1.9.2 so decided to go with it.

High Tech Sorcery

Technology Indistinguishable From Magic

Leave a Reply Cancel reply