We recently deployed Hedgedoc v1externer Link as a replacement for our aging Etherpadexterner Link. During this process we encountered two major problems:

  • Hedgedoc does not support a global history.
  • The Pad search does not work with -.

While solving the first two problems we also encountered a third problem:

  • Having over 1000 Notes available for a user made Hedgedoc (both frontend and backend) slow.

In this post I will describe how we solved all these problems. I will also go into some of the interesting technical details of our code.

General remark

The solution described here is not the sensible solution to the problems we’re faced. The section The sensible solution contains a sketch of a more reasonable solution. We rather took this problem as an opportunity to learn new things. Our solution nonetheless contains some interesting technical details.

What is Hedgedoc?

Hedgedoc, like Etherpad, is a web-based collaborative real time editor. This means that you can edit documents with live in your browser (think Google Docs). The main difference between Hedgedoc and Etherpad is that Hedgedoc supports Markdown. Etherpad only supports Text and basic formating. Hedgedoc on the other hand supports Markdown with all the bells and whistles. You can have a look at the features noteexterner Link for an exhaustive list of all features.

Implementing a global history

A list of all Notesexterner Link or an explore pageexterner Link are being discussed for the rewrite of Hedgedoc that is currently in progress. But it seems to me that the rewrite won’t be complete for some time. So we decided to take matters into our own hands.

A global history can be patched in relatively few lines of code. When a user is logged a list of Notes that this user has visited previously can be retrieved at /history. These notes are then display by the UI. We only have to change this API endpoint to instead return all Notes.


diff --git a/lib/history.js b/lib/history.js
index e0c16da5..ae47b380 100644
--- a/lib/history.js
+++ b/lib/history.js
@@ -11,6 +11,7 @@ const errors = require('./errors')
 // public
 const History = {
   historyGet,
+  historyGetAll,
   historyPost,
   historyDelete,
   updateHistory
@@ -63,6 +64,33 @@ function getHistory (userid, callback) {
   })
 }

+function getAllHistory (callback) {
+  models.Note.findAll()
+    .then(function (notes) {
+      const out = []
+
+      const getId = function (note) {
+        if (note.alias) {
+          return note.alias
+        } else {
+          return models.Note.encodeNoteId(note.id)
+        }
+      }
+
+      notes.forEach(note => out.push({
+        id: getId(note),
+        text: note.title,
+        time: note.updatedAt.getTime(),
+        tags: models.Note.parseNoteInfo(note.content).tags
+      }))
+      logger.info(`read history success: ${out}`)
+      return callback(null, { history: out })
+    }).catch(function (err) {
+      logger.error('read history failed: ' + err)
+      return callback(err, null)
+    })
+}
+
 function setHistory (userid, history, callback) {
   models.User.update({
     history: JSON.stringify(parseHistoryToArray(history))
@@ -132,6 +160,18 @@ function historyGet (req, res) {
   }
 }

+function historyGetAll (req, res) {
+  if (req.isAuthenticated()) {
+    getAllHistory(function (err, history) {
+      if (err) return errors.errorInternalError(res)
+      if (!history) return errors.errorNotFound(res)
+      res.send(history)
+    })
+  } else {
+    return errors.errorForbidden(res)
+  }
+}
+
 function historyPost (req, res) {
   if (req.isAuthenticated()) {
     const noteId = req.params.noteId

diff --git a/lib/web/historyRouter.js b/lib/web/historyRouter.js
index fa426bbb..97b0c3ef 100644
--- a/lib/web/historyRouter.js
+++ b/lib/web/historyRouter.js
@@ -7,7 +7,9 @@ const history = require('../history')
 const historyRouter = module.exports = Router()

 // get history
-historyRouter.get('/history', history.historyGet)
+historyRouter.get('/history', history.historyGetAll)
 // post history
 historyRouter.post('/history', urlencodedParser, history.historyPost)
 // post history by note i

These two small patches are enough to satisfy our functionality requirements. But as you see the performance is lacking. The backend takes about 8 seconds to load the history and the frontend takes a few more seconds to be interactive.

..
Response time of the patched history endpoint

This means that there are now two major problems two solve:

  1. the long response times of the backend
  2. the slow frontend

The cause

The Markdown of a Note has to be parsed to extract the Notes tags from the frontmatter. Hedgedoc only extracts the tags from a Note once it retrieves the history. Because our history now contains all 1315 Notes this means that all Notes will be parsed for every request to the history endpoint. The extraction of the tags and connected parsing of a Notes’ markdown is what causes the long response times of the /history endpoint.

The sensible thing

The sensible thing would be to cache the tags of a Note in a database field. The database field is update when the Notes’ content is updated. This is technique is already being used for the title of a Note. The title of a Note also has to be derived from the content of the Note.

The tags are a comma separated string in the Notes’ frontmatter. A very simple implementation could simply cache the comma separated string from the front matter.

  • Adapt the Note model to include an the additional field tags.
  • Update the value of the tags field when a Note is saved. This could be done together with the title in finishUpdateNote in lib/realtime.js.

function finishUpdateNote (note, _note, callback) {
  if (!note || !note.server) return callback(null, null)
  const body = note.server.document
  const title = note.title = models.Note.parseNoteTitle(body)
  const values = {
    title,
    content: body,
    authorship: note.authorship,
    lastchangeuserId: note.lastchangeuser,
    lastchangeAt: Date.now()
  }
  _note.update(values).then(function (_note) {
    saverSleep = false
    return callback(null, _note)
  }).catch(function (err) {
    logger.error(err)
    return callback(err, null)
  })
}
  • Extract the tags from the tags field when retrieving the history. This could be done in getAllHistory in lib/history.hs.

diff --git a/lib/history.js b/lib/history.js
index ae47b380..86a7ce88 100644
--- a/lib/history.js
+++ b/lib/history.js
@@ -81,7 +81,7 @@ function getAllHistory (callback) {
         id: getId(note),
         text: note.title,
         time: note.updatedAt.getTime(),
-        tags: models.Note.parseNoteInfo(note.content).tags
+        tags: note.tags.split(",").map(tag => tag.trim())
       }))
       logger.info(`read history success: ${out}`)
       return callback(null, { history: out })

Our solution

To solve the problems we created

  • a new frontend for the listing page in Svelte
  • a new backend that provides the required data via a REST api with Django and django-rest-framework
graph TD hedgedoc[Hedgedoc] db[DB] hedgedoc_api[Django Listing Backend] hedgedoc_ui[Svelte Listing Frontend] nginx[Reverse Proxy] user[User] hedgedoc <--> db hedgedoc_api <--> db nginx <--> hedgedoc nginx <--> hedgedoc_api nginx <--> hedgedoc_ui hedgedoc_ui <-.->|REST| hedgedoc_api user <--> nginx

We overlay the of the original Hedgedoc, the new listing backend and the listing frontend using a reverse proxy. The listing frontend retrieves the data from the backend using a REST api. The listing backend uses the same database as the Hedgedoc instance. It reads the Notes from the database and also saves the cached tags there.

To store our additional information we introduce some new models. The new models only have relations to Notes. Other Hedgedoc models have been omitted for simplicity.

  • A Tag has a Name and store the Notes that have this Tag using a n:m relation. This is the Model that caches the Tags.
  • The NoteExtension model stores the date when the tags of a Note have last been updated.
graph TD Note[Note] NoteExtension[NoteExtension] Tag[Tag] NoteExtension -->|1:1| Note Tag <-->|n:m| Note

With these models we cache the tags of the Notes. When the tags are read we first whether the note has been modified after the tags have been calculated the last time. If the note has been modified we update the tags. Then we return the tags.

The code for the backendexterner Link and frontendexterner Link are publicly available.

Some technical details

Using Hedgedoc’s Database in Django

Remember that Hedgedoc and the listing backend access the database. This means that we have to read the data created by Hedgedoc’s ORM in django. Adapting Django to read the data created by Hedgedoc is surprisingly easy.

  • The names of the Models’ field have to match the name of the columns in the database.
  • Add a Meta classexterner Link to your Model. And set managed to false and db_table to the name of the table in the database. The managed indicates whether Django should modify the structure of the table. If managed is false then django will not create, update or delete the table e.g. during migrations. Whether data can be written to the table is unaffected by this option.

class Session(models.Model):
    sid = models.CharField(primary_key=True, max_length=36, editable=False)
    expires = models.DateTimeField(editable=False)
    data = models.TextField(editable=False)
    createdAt = models.DateTimeField(editable=False)
    updatedAt = models.DateTimeField(editable=False)

    class Meta:
        db_table = "Sessions"
        managed = False

Assume that Hedgedocs Database is configured as hedgedoc. Then creating the models already is enough to be able to interact with the data. We just additionally have to indicate which database Django should use with the usingexterner Link method of the queryset.


# Retrieve all sessions
Session.objects.using("hedgedoc").all()

The access to the database can be made more comfortable with database routersexterner Link. A database router decides for things:

  1. from which database should a model be read
  2. to which database should a model be written
  3. should a relation between two specific objects be allowed
  4. whether a migration should be applied on a database

With a database router we can automatically direct read and write operations to the correct database. We could also ensure that no or only the required migrations are applied to Hedgedocs database. With this Django’s default tables will never be created in Hedgedoc’s database.

Keeping the Tag cache up-to-date

Out of the options we considered we decided on updating the tags on read (4). We only found at later that Hedgedoc already caches the title of a Note on write (3). Had we known that at the time of the decision we would probably have chosen option 3.

#OptionAdvantageDisadvantage
1Calculate tags on demandup-to-dateto high performance impact
2Update tags after some time (e.g. 5 min)tags may be outdated
3Update tags when notes are savedup-to-date, very efficientrequires modifications in Hedgedoc
4Update tags on read if the Note has been changed after the tags have been calculatedup-to-date, efficient

Option 4 can be implement with a custom getter for the tags field on the Notes model, e.g. using @property. The getter then behaves

  1. Check if the Note has been modified after the tags have last been updated. Update the tags if this is the case.
  2. Return the tags using the backing field.

    @property
    def tags(self):
        if self.updatedAt > self.noteextension.tags_last_updated:
            self.update_tags()

        return self.tag_set
django-rest-framework and @property

Managed attributes can be used with django-rest-framework by using a custom serializerexterner Link that contains the attributes.

Performance Tip

These queries span 3 tables (Notes, Tags and NoteExtensions). Take care to preload the related data with prefetch_relatedexterner Link. Otherwise the performance will degrade very quickly.