3

Ok , Here is the thing i want a php script to Open and read a user uploaded Word document and take the email addresses that will be in the document and store it to database .

Only the email address ! it will be mixed up like

Email : someone@example.com or like "Email is someone@example.com"

Any format .. One thing for sure is there will be space seperating email id and other words . Can someone help me ? :D

SeanC
  • 15,695
  • 5
  • 45
  • 66

2 Answers2

2

This is a bit broad really. Fundamentally, you need to handle these steps:

Upload the word document

You'll need to let users upload a file. There's a tutorial at w3schools which should get you started

Parse the contents

Office files are complex - each one is technically an entire file system as you can embed images, other documents, etc... The new .docx are actually just zip files with some XML - try renaming one to .zip and opening it. The old-style .doc is a proprietary MS format and while equally complex is far more obfuscated. This library appears to convert word files to html which may make reading them a lot easier.

Find the email address

I suspect your best chance here is to use a regex to extract the email address from the body. What about if there are multiple email addresses? Here's an introduction to email regexes which may be of some help. This answer is for the same thing

For a more detailed answer, you're going to have to provide a more specific question.

Community
  • 1
  • 1
Basic
  • 26,321
  • 24
  • 115
  • 201
  • Thanks .. any Regex example ? – creativecodes Jul 22 '12 at 14:20
  • The link to library requires python ? Can i just rename it ? will it work ? – creativecodes Jul 22 '12 at 15:01
  • @ShyleshMohan Sorry, my mistake. Try [these](http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php) [answers](http://stackoverflow.com/a/7371315/156755). Failing that, you might need to unzip/read yourself - [PhpWord](http://phpword.codeplex.com/) might be a good starting point – Basic Jul 22 '12 at 16:44
1

Convert Word to text:

$filename="file.doc";
$TXTfilename = $filename . ".txt";
$word = new COM("word.application") or die("Unable to instantiate Word object");
$word->Documents->Open($filename);
// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($TXTfilename ,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);
$content = file_get_contents($TXTfilename);
unlink($TXTfilename);

Get all emails in array:

$content = "My email is email@example.com"; // it's example. 
$matches = array();
$pattern = '/[A-Za-z0-9_-]+@[A-Za-z0-9_-]+\.([A-Za-z0-9_-][A-Za-z0-9_]+)/'
preg_match($pattern,$content,$matches);
rgtk
  • 3,240
  • 1
  • 30
  • 36
  • Using COM, you're assuming a windows server with Office installed on it. In addition, Office automation is not good in a production environment - high overhead (a new instance of office for every operation), office still sometimes throws mesageboxes on errors (leaving processes waiting for a user to click Ok before releasing resources), etc... It's not that it won't work but it won't work reliably and robustly. – Basic Jul 22 '12 at 14:14
  • True story. You answer is better. Thanks you. – rgtk Jul 22 '12 at 14:29
  • Only because I've been through the pain of maintaining a system using COM to generate reports for Excel - The damned thing was so temperamental that the server was scheduled to reboot regularly! – Basic Jul 22 '12 at 14:45
  • Thanks for the answers , so using COM is a bad idea ! – creativecodes Jul 22 '12 at 14:59
  • @ShyleshMohan Read [this](http://support.microsoft.com/kb/257757). "Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment." – Basic Jul 22 '12 at 22:29