Read PDF and Word DOC Files Using PHP

One of my customers has an insane amount of PDF and Microsoft Word DOC files on their website. It’s core to their online services so it’s not as though they’re garbage files up on the server. My customer wanted their website’s search engine (Sphider) to read these PDF files and DOC files so that their clients could get at the documents they needed without going through a bunch of summary pages to get them. I was successful in the task, so let me show you how to read PDF and DOC files using PHP.

Reading PDF Files

To read PDF files, you will need to install the XPDF package, which includes pdftotext. Once you have XPDF/pdftotext installed, you run the following PHP statement to get the PDF text:

Reading DOC Files

Like the PDF example above, you’ll need to download another package. This package is called Antiword. Here’s the code to grab the Word DOC content:

The above code does NOT read DOCX files and does not (and purposely so) preserve formatting. There are other libraries that will preserve formatting but in our case, we just want to get at the text.

One thought on “Read PDF and Word DOC Files Using PHP

  1. Karol, I can’t thank you enough. You saved my Alfresco / Share inisllaatton. I was so fed up with the pain and downtime that pdf files going through pdfbox were causing, that I planned on finding replacement software for Alfresco and Share. I was going on 3 hours of down time for this go-round when I found your post. The pdfbox that came with the alfresco inisllaatton had been replaced with a current version long ago after attempting to access an alfresco share site with a pdf file in it would immediately bring down java and tomcat. However, the newer versions still didn’t perform well enough to be used.We have a lot of architectural, mechanical and HVAC, and electrical building drawings and plans. On upload, these files would choke pdfbox the Share uploader would report that the upload failed and the whole system would be dead slow or even inaccessible for sometimes an hour or more while java/pdfbox chewed the processor. After using this fix, I crossed my fingers and restarted Alfresco and Share. It started normally and I could see pdftotext spawning, doing its thing and closing. Yeah, it worked!For information: Ubuntu server 10.04, Alfresco 3.4 d.I used apt to install xpdf-utils, created a webapps/alfresco/WEB-INF/bin/ directory with symlink to the /usr/bin/pdftotext binary, and used your xml file as instructed. I did have to modify line 38 (my file was pdftotext rather than pdftotext-linux). I really appreciate your effort. I’m also glad that I stumbled here because the pdf stamper that you’ve made looks like something that we might be interested in.

