11

Each line is a string

 4 
 minutes 
 12
 minutes
 16
 minutes

I was able to remove the  successfully using str_replace but not the HTML entity. I found this question: How to remove html special chars?

But the preg_replace did not do the job. How can I remove the HTML entity and that A?

Edit: I think I should have said this earlier: I am using DOMDocument::loadHTML() and DOMXpath. Edit: Since this seems like an encoding issue, I should say that this is actually all separate strings.

Community
  • 1
  • 1
Strawberry
  • 66,024
  • 56
  • 149
  • 197
  • 7
    The question is how you got them in the first place... – Artefacto Aug 30 '10 at 00:07
  • I was using DOM to load a HTML page and after parsing it and displaying it, it was just there. I have no idea how it got there. Edit: Well, actually the ` ` was already in the original source, but not the A. – Strawberry Aug 30 '10 at 00:09
  • 1
    What's the encoding of this HTML page and how are you loading it? – Artefacto Aug 30 '10 at 00:20
  • @Artefacto How do I check? I am just using `DOMdocument::loadHTML()` – Strawberry Aug 30 '10 at 00:25
  • That's UTF-8 encoding being rendered as ISO/ASCII, the page you got it from is UTF-8 – MikeAinOz Aug 30 '10 at 00:25
  • @MikeAinOz How do you know that? So what should I do from here then? – Strawberry Aug 30 '10 at 00:27
  • 1
    @MikeAinOz: That sort of look like it - but usually you see `Â[something]`, not a lone `Â`. It could be that this is UTF-8 misinterpreted as ISO-8859-1 (latin1), and what would follow the `Â` is landing on one of latin1's control characters... – Thanatos Aug 30 '10 at 00:33
  • @Thanatos You are right, this is actually `Â[something]`. In my case, `Â[ 4]`. The entity was from the original source. – Strawberry Aug 30 '10 at 00:43
  • Is this HTML file public? Can you post some reproducing code? – Artefacto Aug 30 '10 at 02:12

2 Answers2

19

Alright - I think I've got a handle on this now - I want to expand on some of the encoding errors that people are getting at:

This seems to be an advanced case of Mojibake, but here is what I think is going on. MikeAinOz's original suspicion that this is UTF-8 data is probably true. If we take the following UTF-8 data:

4 minutes

Now, remove the HTML entity, and replace it with the character it actually corresponds with: U+00A0. (It's a non-breaking space, so I can't exactly "show" you. You get the string: "4 minutes". Encode this as UTF-8, and you get the following byte sequence:

characters:  4  [nbsp]   m   i   n ...
bytes     : 34  C2  A0  6D  69  6E ...

(I'm using [nbsp] above to mean a literal non-breaking space (the character, not the HTML entity  , but the character that represents. It's just white-space, and thus, difficult.) Note that the [nbsp]/U+00A0 (non-breaking space) takes 2 bytes to encode in UTF-8.

Now, to go from byte stream back to readable text, we should decode using UTF-8, since that's what we encoded in. Let us use ISO-8859-1 ("latin1") - if you use the wrong one, this is almost always it.

bytes     : 34  C2      A0  6D  69  6E ...
characters:  4   Â  [nbsp]   m   i   n ...

And switch the raw non-breaking space into its HTML entity representation, and you get what you have.

So, either your PHP stuff is interpreting your text in the wrong character set, and you need to tell it otherwise, or you are outputting the result somehow in the wrong character set. More code would be useful here -- where are you getting the data you're passing to this loadHTML, and how are you going about getting the output you're seeing?


Some background: A "character encoding" is just a means of going from a series of characters, to a series of bytes. What bytes represent "é"? UTF-8 says C3 A9, whereas ISO-8859-1 says E9. To get the original text back from a series of bytes, we must know what we encoded it with. If we decode C3 A9 as UTF-8 data, we get "é" back, if we (mistakenly) decode it as ISO-8859-1, we get "é". Junk. In psuedo-code:

utf8-decode ( utf8-encode ( text-data ) )           // OK
iso8859_1-decode ( iso8859_1-encode ( text-data ) ) // OK
iso8859_1-decode ( utf8-encode ( text-data ) )      // Fails
utf8-decode ( iso8859_1-encode ( text-data ) )      // Fails

This isn't PHP code, and isn't your fix... it's just the crux of the problem. Somewhere, over the large scale, that's happening, and things are confused.

bishop
  • 37,830
  • 11
  • 104
  • 139
Thanatos
  • 42,585
  • 14
  • 91
  • 146
0

This looks like an encoding error - your document is encoded with UTF-8, but is being rendered as ASCII. Solving your encoding mis-match will solve your issues. You could try using utf8_decode() on your source before using DOMdocument::loadHTML()

Here's an alternative solution from the DOMdocument::loadHTML() documentation page.

Just Jake
  • 4,698
  • 4
  • 28
  • 33