Home Semantic PDF
Post
Cancel

Semantic PDF

I enjoy archiving media as a hobby and a serious pursuit. I mainly focus on saving books, course notes, and old documentaries. Besides archiving for the sake of archiving itself, it serves a practical purpose, as it allows you to have access to academic material on-demand, and there have been several occasions where notes I’ve saved have disappeared from university web pages. Saving papers themselves presents a new issue, sorting and searching through them. For small amounts of documents, manually sorting them into categories and giving them descriptive files names for searching is a viable option. But as the amount of files increases, remembering file names and locations becomes unreasonable. This is further compounded by the fact that there are not clear lines between all academic topics, which makes manually sorting them inherently difficult.

One possible solution to this is to make the contents of the documents themselves searchable. This can be done by directly matching keywords in a search… but what if you can’t remember the exact terms you’d like to search for, but can only describe the concept? (Which can be a real problem at times!) Instead of directly looking for keywords, a more sophisticated approach would be to use semantic similarity to determine which terms are close to each other in meaning, which can help relevant material to show up high in the search ranking even if said material lacks the same words explicitly used in a query.

Implementation

After a lot of brainstorming, there were some key features that I desired:

  • Automatic recognition of new or deleted files
  • The ability to deal with duplicate files in multiple locations
  • Choosing the number of search results that appear
  • Showing the score of search results
  • Search files semantically, so one can search using similar terms
  • Storing semantic data so everything is preprocessed before searching

The basic outline of the process for running a semantic search is as follows:

\[\text{scan for PDFs}\rightarrow\text{convert PDFs to text}\rightarrow\text{run word2vec}\rightarrow\text{make dictionary}\]

Finding PDFs is easy, we can set a working directory in config.yaml which is used by search.py to walk through given folder and hunt for .pdf type files. This results in a list of directories that can be used in the next step. New PDFs that are found during the scan get read by pytesseract, which converts the pages of a document to images, and these undergo optical character recognition (OCR) to turn them into raw text. Converted text is then checked for stop words to weed out material that does not contain meaningful semantic information.

The cleaned text is fed to word2vec which uses cosine similarity to determine how related words are across the scanned documents. This, and the text conversion process from the previous step, are only done when new documents are found in order to save time. The data generated by word2vec is then added to a dictionary, along with some other important information about the documents.

Making decisions on what to include in the dictionary was challenging because the structure of the dictionary determines how feasible many of my goals were when I set out on designing this. I settled on some key things about my PDFs that should be included in the dictionary:

  • Files paths
  • Word2vec data
  • A hash created from the file data itself

The last item in this list is extremely important because it allows us to identify duplicates of the same file even if they have different names or are stored in different locations, so we can weed out duplicates and show the user multiple places the file is found in search results. Also, when the program processes the text files, the original key used is the file path. It is better to work with the file hashes instead, so the dictionary gets inverted (switching the keys from paths to hashes) before the user searches from it. After that we can return the file path along with the score when a query is made.

Using Semantic PDF

Semantic PDF may be downloaded here along with detailed instructions for use. As a demonstration of basic use, I downloaded Keith Conrad’s expository papers and ran a few searches on them. Processing them did not take as long as I expected (roughly 40 minutes for 246 documents), although with a much larger corpus it would certainly be slow. Here I have entered the command python semantic_pdf/cli.py search ring to look for any documents containting any information about rings. (Note that the program is initializing as it’s the first time it has been run.)

centered image

If I wanted to search for multiple terms, I could replace rings in the previous command with something like "rings fields algebra". By default, this returns the top 3 matches, this can be changed to get the top N matches by adding the argument -n N to the query. And the results yielded from the query in the above image are not bad!

centered image

The first document is clearly about rings, and the second two results are about Noetherian rings!

Future Plans

There are some deficiencies with the current version of this project that I want to resolve in later versions.

  1. Processing documents with OCR is slow, which could be resolved by working in parallel.
  2. Removing stop words destroys useful search terms like the word “p-adic,” which shows up frequently.
  3. Images, equations, scanned book covers, and scanned old books with paper that isn’t uniform in color generate a lot of junk text.

I have some ideas on how these problems could be dealt with but I am at a point where I still want to think of better solutions before modifying my code. The first issue would be the easiest to resolve by running the same process on more cores, but that also requires a better computer. Issue two might be dealt with by using a table full of exceptions to stop words, however, manually creating such a table would be extremely tedious and I have not found a table online that fits my needs.

The third issue is the hardest to deal with but I think it’s also the most interesting problem I currently face. One would expect that the so called ‘junk text’ has characteristics that do not match the typical characteristics of English writing; having either too many spaces than there are letters, or vice versa. My next project will be to analyze documents and determine some statistics on English writing, the ratio of spaces to letters, how often there is punctuation, and how spacing is affected by removing stop words. Automating this statistical process will be important, as the current way the program is set up works for other languages that use the Latin alphabet. I would not want to lose the ability to process other languages by only focusing on English.

Lastly, it would be interesting to add in OCR that can read equations. It’s not something I consider essential for Semantic PDF, but it’s a real possibility.

This post is licensed under CC BY 4.0 by the author.