0

I want to replace utf8 html entities in html sources with real characters. I have the "entities" replacement table which is traversed with code bellow. If I run this code it utilizes my CPU up to 100%.

Please could you help me how to rewrite first loop in better way? I understand that in Lua strings are immutable so I think there are many copies of data variable and this could be the reason.

local entities = {
    {["char"]="!", ["utf"]="!"},
    {["char"]='"', ["utf"]="""},
    {["char"]="#", ["utf"]="#"},
    {["char"]="$", ["utf"]="$"},
    {["char"]="%", ["utf"]="%"},
    {["char"]="&", ["utf"]="&"},
    {["char"]="'", ["utf"]="'"},
    -- +312 rows more
}    

local function clear_text(data)
    for _, e in ipairs(entities) do
        data = string.gsub(data, e.utf, e.char)
    end
    return data
end

-- this is just for testing ... replacement in many html sources
for i=1,200 do
    local data = some_html_page_source()
    clear_text(data)
end
ivan73
  • 695
  • 1
  • 9
  • 16

3 Answers3

0

EDIT: misread question, so rewrote it with the same principle.

According to this answer, you can use str:gsub(pattern, function) to perform a custom replacement on all matches of pattern inside str.

The pattern &#.+; should match all utf characters, calling function for each of the matches.

All that is left to do in the callback function is to find the matching human-readable char, and returning that as the replacing value. To this end, it would be faster if entities was keyed by the utf strings, with their respective char as value, so you don't have to iterate entities every time.

another edit: according to the lua documentation on gsub, the second parameter can be a table. In that case, the lookup is done automatically and it will attempt to use each match as the key, replacing it with the value from that table. That would be the cleanest solution once you restructure entities

Community
  • 1
  • 1
Ward D.S.
  • 536
  • 3
  • 11
0
-- Lua 5.3 required
local html_entities = {
   nbsp = " ",
   lt = "<",
   gt = ">",
   amp = "&",
   euro = "€",
   copy = "©",
   Gamma = "Γ",
   Delta = "Δ",
   prod = "∏",
   sum = "∑",
   forall = "∀",
   exist = "∃",
   empty = "∅",
   nabla = "∇",
   isin = "∈",
   notin = "∉",
   -- + many more rows
}


local str = [[&exist; &euro; &empty; &Delta; &#8364; &#x20AC;]]

str = str:gsub("&(#?)(.-);",
   function(prefix, name)
      if prefix ~= "" then
         return utf8.char(tonumber("0"..name))
      else
         return html_entities[name]
      end
   end
)

print(str)
Egor Skriptunoff
  • 23,359
  • 2
  • 34
  • 64
0

There's another way of replacing the sequence of characters.

local function clear_text(data)
    return (string.gsub(
        data,
        [=[[!"#$%&']]=],  -- all your entries goes here, between [=[  and  ]=]
        function(c)
            return "&#" .. string.byte(c) .. ";"  -- replace with char code
        end
    ))
end

-- this is just for testing ... replacement in many html sources
for i=1,200 do
    local data = "!#!#!#!#!#!";
    print(clear_text(data))
end