cfo.com

Print this article | Return to Article | Return to CFO.com

Spreadsheets: Moving Data from PDF to Excel

Converting a PDF "data dump" into Excel or Access is made easier with a third-party tool.
Bill Jelen, CFO.com | US
November 17, 2010

Editor's Note: To read more about spreadsheets, share a tip with other readers, or suggest a topic for Bill Jelen to cover in an upcoming column, click here to visit CFO.com's Spreadsheet Tips page.

Reader Ray S. wins a copy of Excel for the CEO for this week's question: "I am presented with a download of information in a PDF or Adobe format. This 'data dump' needs to be converted into either Excel or Access so as to be analyzed more easily. Can you provide best practices to accomplish this?"

Excel 2007 and Excel 2010 both support sending your Excel files to PDF. A constant frustration is Excel's inability to later get that data back into Excel. Here is a simple demonstration of the problem:

Figure 1 shows a simple table in Excel. In Excel 2010, use File, Save & Send, Create PDF/XPS Document as shown in Figure 2. The table will be accurately rendered in the PDF document as shown in Figure 3.

 

Fig. 1

MrExcel12Fig1

 

Fig. 2

MrExcel12Fig2

 

Fig. 3

MrExcel12Fig3

 

Select the text in the PDF document by using Edit, Select All, or by dragging the mouse. Next, copy from the PDF document using Edit, Copy, or Ctrl+C. Then switch to Excel. Paste the copied PDF data to Excel and you will find that the original table is now unwound into a relatively useless single column as shown in Figure 4 (below, left). How can it be that the Microsoft Excel team cannot
round-trip a simple table from Excel to an Excel-created PDF and back to Excel? Is this simply the Microsoft Excel team being obstinate?

Fig. 4

MrExcel12Fig4

Fails with PDF; Works with XPS
Notice that the Create PDF/XPS command actually offers to create either the market-dominant PDF format or the upstart XPS format. XPS is the new format designed by Microsoft to compete with the PDF format. An unscientific search of Google indicates that PDF has a 98.5% market share compared with 1.5% for XPS. The Excel team smartly makes "PDF" the default choice. However, if you would instead save the file as XPS, copy the table from XPS, and paste back to Excel, you will see that the table retains its original shape, font color, and numeric formatting. Could it be that the Excel team is trying to boost the popularity of XPS instead of PDF?

One solution, then, is to start asking for your data in XPS instead of PDF. The downside is that 98.5% of your recipients won't have a clue how to produce XPS files.

The Microsoft Word Team Can Get It Right
If this truly is a case of Microsoft purposely doing a bad job dealing with Adobe PDF, the animosity is apparently not spread throughout the entire Microsoft Office team. If you copy the data from PDF and paste it to Word, the table appears in a row-and-column format, although the final zero in the grand total wraps in that cell of the table, as shown in Figure 5 (below).

Fig. 5

MrExcel12Fig5

A seemingly ridiculous solution, then, is to copy the PDF, paste to Word, copy from Word, and paste to Excel. The final zero goes back into the grand total cell. The result shown in Figure 6 (below) is passable.

Fig. 6

MrExcel12Fig6

Why Can't Excel Get This Right?
For answers, I turned to Keith Bradbury of InvestInTech.com. Keith's company is the inventor of Able2Extract, a program that can take PDF files and turn them back into Excel workbooks, Access tables, Word documents, and more. InvestInTech is the expert in the field of getting data from PDF to Excel or Access; I've been using its software to painlessly extract Excel data from PDF files since the early 2000s.

Keith's team looked at my examples. They pointed out that each Copy (or Ctrl+C) in any program sends several formats of data to the Windows Clipboard such that the receiving program can choose which to use. For example, copy some data from WordPad, and the application sends six versions of that data to the clipboard (text, RTF, UTF8, and so on).

When you copy from Adobe, the program sends two versions of the data, one in plain text and one in rich text format. Keith's team can only guess, but apparently, Microsoft Word is reading the rich text format from the clipboard and Excel is reading the plain text, with less-than-stellar results.


Use a Third-Party Tool
Let's be honest. I think a lot of people convert their Excel files to PDF because they don't want the recipient to be able to reuse the data in Excel. Send me a 450 page document in PDF and they figure I won't have the time or inclination to take each page from PDF to Word and back to Excel. I actually had this exact problem once and converted the whole document back to an Excel database using Able2Extract. You can test out Able2Extract 7 with its free seven-day trial.

The software is simple to use. Open the PDF file with Able2Extract. The first step is to identify which area(s) you want to import. If the document contains titles or page numbers in the header, you could choose to crop the page to ignore those sections of the document. Able2Extract offers to convert the data to Excel, Word, PowerPoint, Publisher, HTML, or AutoCad, as shown in Figure 7.

The result is shown in Figure 8. Able2Extract preserved the original font colors and numeric formatting. The process was not perfect; the font size changed from 11 point to 9 point and the right-justified headings moved back to the left. However, once I had the data in a table in Excel, it is easy enough to fix those issues.

Able2Extract also offers a pro version that will use OCR to recognize characters in PDF documents that have been created using a scanner. Download a free trial from http://www.investintech.com/.

Best Practice?
It really should not be this hard to get data out of a PDF file and back into Excel. If you have a one-page table, the PDF-to-Word-to-Excel solution will work suitably well. If you have a several-page document with many different tables or repeating headers, then going to a third-party solution such as Able2Extract makes sense.

Fig. 7

MrExcel12Fig7

 

Fig. 8

MrExcel12Fig8

 

Bill Jelen, a CFO contributing editor, runs MrExcel.com and is the author of 32 books about Microsoft Excel, including Charts & Graphs for Excel 2010. Post questions for future columns in the "Suggest a Topic" section of the Spreadsheet Community Center on the right.




CFO Publishing Corporation 2009. All rights reserved.