Scraping PDFs¶
A nice feature of Evernote is that it scrapes PDFs you upload in order to make
them searchable. Magpie has that ability as well, if you use the magpie util
pdf_scraper.py
.
Configuration¶
pdf_scraper.py
has a config file, like magpie and email_notes.py
, and it
lives in the same place as the other config files. It only has two options to
configure:
- repo: This is (presumably) the same value you specified in the magpie config.
- default_notebook: If you run
pdf_scraper.py
on a PDF outside of your magpie repo, the plaintext output will be written to this notebook inside the magpie repo, rather than wherever the PDF lives
Scraping¶
Using pdf_scraper.py
should be fairly straightforward. After configuration,
simply run pdf_scraper.py /path/to/pdf1.pdf /another/path/pdf2.pdf etc.pdf
.
The scraper will run against each of the files passed as command line arguments.
If the PDFs were inside the configured magpie repo, then the output files will
be stored in the same location as the original PDF, and then name will be
identical, except the filename will have a leading period. For example, if the
PDF was /path/to/file.pdf
then the plaintext output from the scraper would
be stored in /path/to/.file.pdf
. If the PDF is not already in the configured
repo, then the file will still start with ”.”, and will be saved to the
default_notebook from the config file.
Ugly Output¶
The odds are pretty good that the output of pdf_scraper.py
will be ugly.
That’s the best I could do for now. The purpose of this functionality is
primarily to allow for searching the PDFs, not necessarily to read their
contents in the web application. However, once the plaintext version exists in
magpie, you can edit it in the web application just like any other note, and it
will not impact the PDF. This means if you want to make the note readable and
clean it up, you can.