0

I am starting a recruitment consultancy and sooner or later we would be dealing with many applicant résumés or CV (curriculum vitae). I am building a simple application with PHP and MySQL (target server to be windows) to let applicant upload CV on our website. Currently I would be restricting upload files to be only MS Word docs and MAX size 500 KB.

Now my question is around two operations which would be performed on these files.

  1. Search content inside these files on specific key words to find relevant skills matching resumes.

  2. Then serve these files to our employers either through download file link or email the resumes to them.

Coming straight to the questions

  1. Do I store the actual files on File System and perform Windows search on them?

  2. Or I only insert the content in to the MySQL blob/cblob, perform search on the table and then serve the content from the table itself to the employer.

  3. Or I Store the file on File System and also insert the content in mysql blob. Search the content in mysql and serve the file from File System.

I am of the opinion that once the number of résumés reaches thousands, the Windows search would be extremely slow but then I search on internet and find that it is not advisable to store huge amount of file contents in a database.

So I just need your suggestion on the approach I should adopt in light of the assumption that at some point of time we would be storing and retrieving thousands of resumes.

Thanks in advance for your help.

Laurel
  • 5,965
  • 14
  • 31
  • 57
nigs
  • 11
  • 1
  • 2
  • If they are uploaded in plain-text, then you can just search them when uploaded for all the keywords you may be looking for and store those in the DB with the file-path to the actual file, and when doing the processing you search the db for the keywords, and pull the files that match. But if they are not plain-text resumes/CV, then you'll want to find a library to read the file type you are willing to accept first, and then do that. – Jon Jan 13 '13 at 07:08
  • Thanks Jon. I don't think unless we ask users to copy paste the text in some textfield on our site, we could actually control the formatting of resume content uploaded on our site. I still have kept copy paste resume in a textarea as an option in mind if I really have to go for a temporary solution for initial launch of website. In between any particular library for reading word documents through php in reference to your comment above? – nigs Jan 15 '13 at 08:02
  • I can think of two: http://static.holloway.co.nz/docvert/ and http://www.blogs.zeenor.com/it/read-ms-word-docx-ms-word-2007-file-document-using-php.html The first one is the one that I would recommend, though it converts to HTML, the second is more a quick-and-dirty solution to read them and has a brief 'how-to'. I recommend the first because it would give you more options of files you can read - the second is just for `.docx` files and only has one associated function. So, it depends on your needs. ^^ – Jon Jan 15 '13 at 08:31

2 Answers2

0

One option, a hybrid: Index the resumes into a db, but store a filesystem path as the location. When you get a hit in the db and want to retrieve the resume, get it off the file system via the path indicated in the db.

DWright
  • 9,258
  • 4
  • 36
  • 53
  • Thanks DWright. Your comment "index the resume into a db" is something I missed out in my initial thought and it probably goes with Mario solution below. – nigs Jan 15 '13 at 07:59
  • Cool! Glad that helped a bit. – DWright Jan 15 '13 at 08:02
0

What you want is a fulltext index of the documents. This tends to be a job for e.g. Solr (see this cross reference on StackOverflow: How do I index documents in Solr). The database would keep a reference to the file on the disk. You should not try to save blob data to an innodb table that does not run on the barracuda format using row_format=dynamic. Please refer to the MySQL performance blog for further details on the Blob storage in innodb topic.

Community
  • 1
  • 1
Mario Mueller
  • 1,450
  • 2
  • 13
  • 16
  • Thanks Mario. We are looking in to this Solr option. Will update you'll with the results. – nigs Jan 15 '13 at 07:58