Monday, November 4, 2013

Read a PDF file with OCR

So far in all the previous examples we check how we can read the string or any type of a data on an image. Now lets see how we can read a PDF file with the use of OCR class methods. 

to start off you can create a simple PDF file with sample texts in it (Many pages as you want) or simply download any PDF file. In my example i have created a pdf file with 03 pages and below text on each page 

Page 01 
Page 02 
Page 03 

now below code will read the PDF file and will print the data inside of it.

import java.awt.image.BufferedImage;  
import java.io.File;   
import com.asprise.util.ocr.OCR;  
import com.asprise.util.pdf.PDFReader;
public class Test2 { 
public static void main(String[] args) throws Exception {

//Creates a new object from OCR class 
OCR ocr = new OCR();

//Creates a new object from PDFReader class and assign the PDF file location 
PDFReader reader = new PDFReader(new File("D:\\1.pdf"));

//Open the PDF file 
reader.open();

//Assign the number of pages in the PDF file to a int variable 
int pages = reader.getNumberOfPages();

//Prints the number of pages in the PDF file 
System.out.println("Number of pages are "+pages);

//Read the contents inside the PDF file in a loop and prints the contents 
for(int i=0; i<pages; i++) {
BufferedImage image = reader.getPageAsImage(i);
System.out.println("OCR result:\n" + ocr.recognizeEverything(image));                             }

//Close the PDF file 
reader.close();
   
 }

}


The output of this file is 

Number of pages are :  3

OCR result:

^UNLICENSED VERSlON FOR EVALUATlON PURPOSE ONLY. Asprise Java PDF Libray - http:llasp
Page 01

OCR result:
^UNLICENSED VERSlON FOR EVALUATlON PURPOSE ONLY. Asprise Java PDF Libray - http:llasp
Page OZ

OCR result:
^UNLICENSED VERSlON FOR EVALUATlON PURPOSE ONLY. Asprise Java PDF Libray - http:llasp
Page 03


No comments:

Post a Comment