0

In my http logs I see: "GET /category/f%C2%ADile-to-download/ HTTP/1.1" 301 instead of "GET /category/file-to-download/ HTTP/1.1" 200 I discovered that %C2%AD is a soft hyphen (invisible symbol).

I need to check if a query to Apache contains a soft hypen and if it does to remove it. Any suggestions on the best method to locate soft hyphen and remove it? I made some tests with RewriteRule, but got stuck.

Thanks!

user2285323
  • 119
  • 1
  • 11
  • Do you know the referrer of the link? I would guess that the link is in a document (pdf or something?), and being split over two lines, which may result in the soft-hyphen. – icabod Jun 10 '13 at 16:02
  • For example comments in youtube are modified with soft hyphens. – user2285323 Jun 11 '13 at 07:49

2 Answers2

0

As I understand it, mod_rewrite uses un-escaped characters, so in order for you to correctly match the soft-hyphen and then remove it, you would need to edit and save your .htaccess file in UTF-8 encoding (most modern editors will do this).

You will then need to enter a soft-hyphen into your rule. The following will (should!?) remove a single soft-hyphen from your input, but as mentioned it relies on the file being in UTF-8 format:

RewriteRule ([^-]*)-([^-]*) $1$2

Note that you would need to replace the - with the actual UTF-8 dash.

Perhaps an easier option would be this:

RewriteRule ([^\xc2\xad]*)\xc2\xad([^\xc2\xad]*) $1$2 [N]

It uses the specific UTF-8 code you're seeing to remove it from the string. The [N] should rerun all the rewrite rules, which will remove any remaining soft-hyphens.

Community
  • 1
  • 1
icabod
  • 6,992
  • 25
  • 41
  • Currently I got this working rule in my case `RewriteRule ([^\xc2\xad]*)[\xc2\xad]+([^\xc2\xad]*) /$1$2 [N,R=301,L]` – user2285323 Jun 11 '13 at 07:33
0

Thanks @icabod

Currently I got this rule working in my case:

RewriteCond %{REQUEST_URI} \xc2\xad [NC]
RewriteRule ([^\xc2\xad]*)[\xc2\xad]+([^\xc2\xad]*) /$1$2 [N,R=301,L,NC]

.htaccess should be in UTF-8 format as mentioned above. R=301 - redirect with HTTP code 301 NC - case insensitive But it doesn't work with two soft hyphens in the different places of the URL like this:

/category/f%C2%ADile-to-d%C2%ADownload/

user2285323
  • 119
  • 1
  • 11
  • The reason it probably doesn't work for multiple soft hyphens is that you have the `[L]` flag specified, which will stop any further rules from running. I guess this has precedence over `[N]`, which re-runs the rules. I would try removing the `[L]` flag. – icabod Jun 11 '13 at 08:09