2

I'm pulling articles from specific URLs for conversion to sentences, but the text body has a random behavior of eliminating whitespace between some sentences resulting in:

Jane went to the store.She bought a dog. The dog was very friendly.It had no teeth.

Some of my text is stock symbols (AZ.GAN) etc. So I can't simply insert a space between all periods which have no adjacent whitespace.

Jane bought several shares of (TY.JPN). She lost all her cash money."Arg!" She cried.

The above example would destroy the stock symbol variable.

Curious if anyone knows the cause of this. I have tried several HTML and DOM. I use Simple_DOM to grab the plaintext. Although, I get the same result if I do it manually, or with any other parsing engine.

user723220
  • 817
  • 3
  • 12
  • 20
  • 2
    This doesn't sound right... do you have a sample URL for the source data? Or a short snippet (with markup) where this is happening? I'm assuming it doesn't display that way in a browser (if it does, it's GIGO) – Phil Lello Apr 28 '11 at 23:14
  • If is the case of a line break PHP [nl2br()](http://php.net/manual/en/function.nl2br.php) should help. Take a look at [htmlspecialchars()](http://www.php.net/manual/en/function.htmlspecialchars.php) and [htmlentities()](http://www.php.net/manual/en/function.htmlentities.php) – msmafra Apr 28 '11 at 23:28
  • I can't access the markup, it's preprocessed through the Simple DOM object functions, which I've tried to manipulate without success. Plain text is obtained file_get_html('url')->plaintext. Not sure what it loses in the process which causes the erratic spacing. Could be a wordwrap artifact. – user723220 Apr 29 '11 at 00:24
  • 3
    @Alix: [Garbage In, Garbage Out](http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out) :) – sarnold Apr 29 '11 at 00:32
  • @sarnold: Oh, cool. Didn't knew there was an acronym for that! =) – Alix Axel Apr 29 '11 at 01:11

2 Answers2

3

Unfortunately I don't have an approach for your specific question, but is it possible that the missing space between sentences is actually a linebreak (e.g. \n) that your text viewer (whatever it is) isn't showing you?

Perhaps try something like this just to make sure

var articleContent = ... // get content
articleContent = articleContent.replace(/\n/g, ' NEW LINE ');

tomfumb
  • 3,669
  • 3
  • 34
  • 50
  • Using Simple_DOM to pull plain text from urls. PHP. I can do it manually by stripping tags/CSS/HTML, but get the same result. Spacing random. I'll try the newline trick and get back - thanks – user723220 Apr 28 '11 at 23:30
  • No, stripping the \n doesn't work - still with random spacing. – user723220 Apr 28 '11 at 23:50
1

Try doing:

$str = trim(preg_replace('~([(].+?[.])\s(.+?[)])~', '$1$2', str_replace('.', '. ', $str)));
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • Hmm that script is promising, but it obliterated my text and the behavior of my delimiters. The stock symbols vanished. Quotes became replaced by parentheses. Unfortunately, I have to way to access the HTML markup, because it is processed through the Simple DOM Object. – user723220 Apr 29 '11 at 00:18
  • @user723220: We can work on it... =) Can you see any differences in the spacing? Also, if you could post the output of the following snippet it would help immensely: `echo implode('|', array_map('ord', str_split($yourString)));`. – Alix Axel Apr 29 '11 at 00:25
  • I get a stream of thousands of numbers from that `10|32|115|97|105|100|44|32|97|110|100|32|97|116|32|108|101|97|115|116|32|49|55|32|112|101|111|112|108|101|32|119|101|114|101|32|107|10` – user723220 Apr 29 '11 at 01:27
  • 1
    I looked at the source HTML. After most sentences this code `

    ` appears. Must be the culprit. It must be getting stripped and leaving "" instead of " ". I contacted the author of Simple_DOM, to see if he had any thoughts on finding where this parsing happens in the simple_dom_php. Problem is it's all done cleverly through objects, so you can't tell what is what.

    – user723220 Apr 29 '11 at 01:37
  • The above preg_replace seems to have no effect. It may be reducing some of the stuck sentences, but other articles not at all. – user723220 Apr 29 '11 at 02:06
  • @user723220: Yup, your problem has nothing to do with control chars, I assumed this because you mentioned you were using Simple_DOM to get plain text URLs, but apparently no. Somehow, Simple_DOM is messing the HTML markup, which makes it very unreliable to trust `strip_tags()` to preserve the content and work around it. As for the spacing, it must be coming from UA or custom CSS styles. My suggestion is that you quit using Simple_DOM and code what you need yourself, or you can `$str = str_replace('

    ', ' ', $str)` but that's kinda hacky...

    – Alix Axel Apr 29 '11 at 02:17
  • I'm going to write a script which turns all "." into ". " unless they are within parentheses. – user723220 Apr 29 '11 at 02:21