PDF To Text Python – Extract Text From PDF Documents Using PyPDF2 Module

Welcome to my new post PDF To Text Python. Here you will learn, how to extract text from PDF files using python. Python provides many modules to extract text from PDF. So let’s start this tutorial without wasting the time.

PDF To Text Python – How To Extract Text From PDF

Before proceeding to main topic of this post, i will explain you some use cases where these type of PDF extraction required.

  • One example is, you are using job portal where people used to upload their CV in PDF format. And when the recruiters researching for some kind of keywords like say a recruiters needs Hadoop developers, big data developers, python developers, java developers etc. So the keyword will be get  matched with the skills what you have specified in the resume. This is again a processing so  they extract data from your PDF document and they will matched with the keyword what the recruiter is searching for and then they will just give you your name, email or all those stuffs. So this is the use case. 

Python provides many modules for PDF extraction but here we will see PyPDF2 module. So let’e see how to extract text from PDF using this module.

PDF To Text Python – Extraction Text Using PyPDF2 module

  • PyPDF2 is a Pure-Python library built as a PDF toolkit. It is capable of:

    • extracting document information (title, author, …)
    • splitting documents page by page
    • merging documents page by page
    • cropping pages
    • merging multiple pages into a single page
    • encrypting and decrypting PDF files
    • and more!

So now we will see how to extract text from PDF using PyPDF2 module. Write the following code on your python IDE(check best python IDEs).

Installing PyPDF2

Run the following command on terminal to install PyPDF2.

Importing PyPDF2

Now you have to import PyPDF2 module. So write the following code.

Creating a PDF File Object

Write the following code to create a PDF file object.

Now you have to open your file to read. open() method is used to read file in python. And give the input of your file name and file path. The file is opened in rb mode( r for read and b for binary). PDF file is considered as binary file so you need to read it from binary file.

Creating A PDF Reader Object

Now create an object of PdfFileReader class of PyPDF2 module and pass PDF file object that holds the file.

Printing Number Of Pages In PDF File

You can also request number of pages of your file. So just use pdfReader.numPages, it will return total number of pages that are in PDF file.

Creating A Page Object

And now you will read a particular content from particular page. So create an object and invoke pdfReader class and getPage() function and inside getPage() function you need to give the page number. The page number should start from 0 that is equals to page number one of PDF file.

Extracting Text From Page

extractText() function is used to extract the text of PDF. In this example, it will extract the text of page one from PDF.

Closing The PDF File Object

Now to close the file object write the following code.

PDF To Text Python Using PyPDF2 Complete Code 

So here is the complete code of extracting text from PDF file using PyPDF2 module in python.

Now let’s check its output.

PDF To Text Python
PDF To Text Python

We have completed this successfully now it’s time to wrap up this post.

So guys, this was all about PDF To Text Python tutorial. I hope it is very helpful for you, if yes then please share it with your friends. And if you have any problem about this post or you are getting difficulties during coding then feel free to ask your questions in comment section. And one thing you must follow that keep checking Simplified Python’s posts. HAPPY CODING.

You Can Also Check These Articles :

 

 

Leave a Comment