PDF redaction – AstraZeneca EU contracts – s**t happens

Excerpt: As the AstraZeneca vaccine contract debacle makes clear, redacting PDF involves more than just the page; other objects have to be checked as well.


About the author: Michael Karbe is Managing Director of Actino Software and has been a member of the PDF Association since its founding in 2006. His company supports organizations and institutions in digitizing … Read more
Article

February 9, 2021
by Michael Karbe


What happened?

Example of the AstraZeneca contract redaction
Example of the AstraZeneca contract redaction

After correctly redacting the text passages on the PDF page, the PDF’s bookmarks referring to redacted content were overlooked. The confidential information was removed from the page as intended – but was unfortunately disclosed in the PDF bookmarks!

See the Spiegel Online article (in German) from January 29th, 2021: “That is in the redacted passages of the AstraZeneca contract. The EU Commission apparently made a serious glitch when it published the vaccine contract with AstraZeneca: Parts of redacted passages are legible – and contain some interesting information.”

It is always critical when confidential information gets into the public and / or the wrong hands. The fact that the page was redacted correctly in this case does not release you from your obligation to review other PDF objects.

Features to look for in PDF redaction software

Comprehensive PDF redaction covers the following areas:

  1. document and object properties
    1. Title, subject, author, keywords
    2. XMP document metadata
    3. Object metadata, especially for images
  2. navigation elements
    1. Bookmarks (as in the Astra-Zeneca case!)
    2. Links and references
    3. Accessibility tags
  3. content
    1. OCR-generated text, including white text on a white background.
    2. Check image layers for hidden negative shapes.
    3. Check attachments and embedded files.
    4. JavaScripts
  4. private data
    1. Hidden content that is ignored by the reader software

Emergency procedure

If you lack appropriate redaction software it’s possible to print the redacted PDF, scan it and run text recognition over the new PDF document before distribution.

Advantages

  • An entirely new PDF file means that no hints to the creator from the source document are present.
  • Any “invisible” content is gone.

Disadvantages

  • File size can quickly increase by a factor of 4, occupying storage, hindering fast downloads and slowing downstream use.
  • Text and images are reduced in quality.
  • Navigation elements and links must be recreated.
  • Tags for accessibility must be recreated.
  • Digital image credits and other copyright information are missing.

There are a variety of solutions for content redaction; see the redaction category in the PDF Association’s Product Index. In addition to anonymization, pseudonymization may also be required, in which case each redaction is marked by codes indicating the purpose of the redaction.