SQL Schema
The purpose of this page is to be clear about how Aytwit stores data. SQL is a really nerdy version of spreadsheets like in Excel or Google Docs. The acronym stands for Structured Query Language, but is most commonly used to describe a type of relational database that like 99% of the Internet uses. The schemas below are written in SQL and describe tables (the "spreadsheets") that Aytwit uses to store all data. You'll hopefully note the naming conventions for columns in these tables, which indicate the cryptographic measures applied to keep personal data safe from hackers and even Aytwit itself. See detailed specifications below. I dare you to find any privacy leaks beyond the trivial and/or information-theoretically necessary ones already noted.

Thought Table Verification Table Notification Table Session Table Event Table Optout Table Subscriber Table Donor Table
Thought Table
  This table is the pulsing brain of Thoughter, where all thoughts wander around
  hoping for the chance that they'll soon meet another thought just like them.
  - Below you'll find references to TLVs. For Thoughter these are 2 bytes for
    the data "type", 2 bytes for its length, and then an arbitrary number
    of bytes up to the maximum (an unsigned "short") that represents some
    scalar value, usually a number or string. For example a length of time
    in milliseconds or an email address. Thoughter specifically could probably
    get by with one byte each for the type and length, but it's pushing the
    limits. For example a valid email address has a maximum size of 254, which
    is one away from the maximum value that one byte can represent. Thoughter
    may also support other forms of "identification" that are longer than 255
    in the future. Easier to support larger sizes now rather than patch things
  - Using AES-128 and not AES-192 or AES-256 mostly for performance reasons.
    Besides, if a weakness is found in AES-128 it would most likely apply to
    AES-256 as well, and in the worst case if a quantum computer is finally
    made that is powerful enough to crack AES-128 but not AES-256, then
    I can slowly migrate to the latter, safe in the assurance that Thoughter
    probably wouldn't be a juicy enough target to point the first quantum
    computer at within the time it would take to cycle out all AES-128
    ciphertexts (1 year, which is the current maximum thought duration).

CREATE TABLE thought (
    (1) Upon a person pushing a thought, generate secure-random 32 bytes.
    (2) SHA-512 to 64 bytes.
    - The original unhashed token is Base64-ed and sent in a hyperlink to the
      Patron and they use this to confirm their address.
    - The original unhashed token is used to encrypt
    - Using a hash of the token, rather than the token itself, makes it so
      this table could literally be made read-only and public, but there's no
      reasonable way to call Thoughter with the original token since you would
      have to brute force reverse the hashed token here. Well, you could just
      blindly try any random 32 bytes and hope it hashes to any row in this
      table, but I'm pretty sure that would take until "explosion of the sun"
      AD until a collision were found. Unless Thoughter goes galactic...

  thought_token__hashed BINARY(64) NOT NULL PRIMARY KEY,
    (1) Take sender address plus receiver address plus the message's hashtag
        and create a TLV (type-length-value).
    (2) SHA-512 to 64 bytes.
    (3) (step removed for now, see below) BCRYPT to 60 bytes with work factor
        of 10. Same 16-byte salt is used for all hashing and is stored on
        application server.
    (4) AES-128 encrypt with master password, salt, and IV (initialization
        vector, which is kind of "salty" too) stored on application server
        (AKA pepper, defense in depth) which padds out to next multiple of 16,
        which is 80 since we're not doing Bcrypt, at least for now. Bcrypt
        (as mentioned elsewhere also) would bring the un-peppered hash down
        to 60 bytes, making the next multiple equal to 64 bytes.
    - We could avoid the extra AES-128 encryption here and just hash the TLV
      with a random global salt. It might be easier on the CPU. However it's
      no more secure. If the salt is leaked it's the same as the encryption
      key getting leaked. There's a potential benefit to having the ability to
      rotate keys in the event of a leak, or as a good passive defensive
      measure, which you can't do with one-way hashing. Well, besides adding
      more and more onion-skin-hashing but that's not sustainable.
    - Typical AES usage involves "rounding up" with random padding to the next
      multiple of 16 bytes even if the plaintext (here the SHA-512 hash) is
      already a multiple of 16. If we use BCRYPT again in the future, then the
      column size could go back down to 64 bytes. See
      `receipt.receipt_hash__peppered` where BCRYPT is used and so 60 bytes
      turns into 64 bytes.
    - BCRYPT step has been removed for initial version. It can be added back
      easily in the future if needed, but for now the extra CPU cost doesn't
      seem worth it because I can't think of any real "attack" that BCRYPT
      prevents over just SHA512. If someone was literally trying to brute
      force one individual thought for whatever reason, I suppose it would
      help there, but I can't think of what a good reason would be. Note
      further that BCRYPT *is* being kept for `receipt.receipt_hash__peppered`
      (which is a similar column to this one) because it's more of a "why
      not?" situation. See further details on that column.
    - Using the same salt for all BCRYPT hashing is fine since this column
      must be unique anyway. Rainbow table attacks are thus not possible at
      any given time. That said, thoughts are deleted from this table on
      thought match or expiration, so an attacker *could* be snapshotting the
      database and thus seeing identical hashes due to the same salt being
      used, but they would only be able to tell them that the same person sent
      the same thought to another same person X times. It wouldn't be really
      useful for anything compared to traditional password cracking where
      duplicate hashes can be leveraged.
    - It's questionable whether BCRYPT really offers any further protection in
      the first place over SHA-512. If an attacker is able to decrypt this
      column by getting access to the encryption key on the application
      server, the only real attack they could pull off would be to see if a
      specific person sent a specific message to another specific person.
      BCRYPT wouldn't really slow that attack down.
    - AES-128 salt and IV are environment variables so they don't have to be
      stored alongside like some other encrypted columns in this database.
      They can't be randomly generated per thought and prepended onto this
      column (or stored in another column) because the SQL engine must be able
      to match on it.
    - For more hashtags per message in the future, repeat steps for each
      hashtag in the message, so a message with 2 hashtags would generate two

  thashes__peppered BINARY(80) NOT NULL,
    (1) Take sender address plus receiver address plus the message plus hint
        plus payment information and create a TLV (type-length-value).
    (2) AES-128 encrypt TLV using unhashed `thought_token__hashed` as the key
        secret material, with 24 random bytes for salt and IV (latter
    - Max size is calculated by salt_and_iv(8+16) + sender_tlv(2+2+254) +
      receiver_tlv(2+2+254) + message_tlv(2+2+64)) + hintType_tlv(2+2+64)) +
      durationCharge_tlv(2+2+8) + paymentInfo_tlv(2+2+128)) +
      eventId_tlv(2+2+8) + thoughtPushTime_tlv(2+2+8) +
      socialProof_tlv(2+2+16) + thoughtDuration_tlv(2+2+8) +
      authType_tlv(2+2+16)*2 + emailForHint_tlv(2+2+254) + wiggle(16).
    - Actually, scratch that size calculation. It got too muddled so
      I bumped it up to a power of two. Who knows there might be some SQL
      engine optimization that can be made with power of two sized columns.
    - The last numbers in the TLVs above represent the maximum size in bytes
      of the given piece of data, not necessarily the actual size.
    - The last 16 is wiggle room for AES-128 which requires padding to a
      multiple of 16. Could probably be +15 or something technically.
    - Message is constrained to ASCII characters so string length equals byte
    - The amount of money to be charged is kept in the TLV, instead of figured
      out on thought match, just in case prices have to increase for whatever
      reason. Storing the cost here locks in the amount to charge.
    - Email spec used here defines maximum length as 254 with only ASCII
      characters (i.e. one byte each), which seems mostly agreed upon online.

  thought_private__token_encrypted VARBINARY(1024) NOT NULL,
    (1) Starts out NULL when a thought is initially pushed, indicating the
        patron's address has not been confirmed.
    (2) Upon address confirmation, take message plus payment information from
        thought_private__token_encrypted` and create a TLV
    (3) AES-128 encrypt using sender+receiver+hashtag TLV as the key's secret
        material, with 24 random bytes prepended for salt and IV.
    (4) AES-128 encrypt with key only accessible by app server (AKA pepper,
        defense in depth), again using 24 random bytes (prepended) for salt
        and IV.
    - Max size is calculated by salt_and_iv(8+16)*2 + message_tlv(2+2+64))*2 +
      durationCharge_tlv(2+2+8) + paymentInfo_tlv(2+2+128)) +
      eventId_tlv(2+2+8) + thoughtPushTime_tlv(2+2+8) +
      emailAddyForNotification_tlv(2+2+254) + wiggle(16).
    - Like with `thought_private__token_encrypted`, scratch that size
      calculation. Rounding up to a reasonable power of two since it
      looks cool.
    - See note in `thought_private__token_encrypted` for why the event id
      is in this TLV.
    - Max size last tested at 631. Bumping a bit for insurance.

  thought_shared__thashes_encrypted__peppered VARBINARY(768) DEFAULT NULL,
    (1) First inserted as the epoch time in milliseconds when the thought
        is first push and proof/activation is required. For now 30 minutes
        after the initial thought push is the timeout.
    (2) Then becomes the epoch time of when the thought itself will expire
        after it is activated.
        (3) Then it turns into some time (as of now an hour) past thought match
    - An automated cron job-type clean up process should clean this row up
          whenever this expiration time is reached.

  thought_expiration_time BIGINT(20) NOT NULL
Verification Table
  Thoughter sends a verification token to a patron's email. When they click on
  the link containing this verification token, we can look up the hashed token
  in this table, then use the unhashed token to decrypt the thought token and
  activate the thought in the `thought` table. Simple right?!?!

  A row is deleted immediately after look up, so that nobody going back through
  your email (even your email provider!) can access your thought by clicking
  on the link again.

CREATE TABLE verification (
    Same deal as `thought.thought_token__hashed`. The original unhashed
    verification token is sent to the patron's email.

  verification_token__hashed BINARY(64) NOT NULL PRIMARY KEY,
    For now this just stores `thought.thought_token` but may in future store
    more state, which is then used to activate the thought.

  thought_info__token_encrypted VARBINARY(1024) NOT NULL,
    You get the idea.

  verification_expiration_time BIGINT(20) NOT NULL
Notification Table
  Stores all the information necessary for a patron to get a notification
  through email when their thought gets matched. We could send all the
  information directly to their email address, but that would be a possible
  privacy/security/anonymity hole. So instead they must click the linkt

CREATE TABLE notification (
    Same deal as `thought.thought_token__hashed`. The original unhashed
    notification token is sent to the patron's email as a clickable link.

  notification_token__hashed BINARY(64) NOT NULL PRIMARY KEY,
    Stores the matched thought message, who the match is from, and payment
    information. Tested max size at 599. Bumping a bit for insurance.

  notification_info__token_encrypted VARBINARY(650) NOT NULL,
    You get the idea.

  notification_expiration_time BIGINT(20) NOT NULL
Session Table
  Used to remember a particular identity from a given 3rd party identity
  provider so a patron doesn't have to repeatedly prove themselves for

    Same deal as `thought.thought_token__hashed`. The original unhashed
    session token is stored as an HTTP cookie on the patron's machine.

  session_token__hashed BINARY(64) NOT NULL PRIMARY KEY,
    (1) Take identity and auth type and a few other details and  create a
        TLV (type-length-value).
    (2) AES-128 encrypt TLV using unhashed `session_token__hashed` as the key
        secret material, with 24 random bytes for salt and IV (latter
    - Max size is calculated by salt_and_iv(8+16) + identity_tlv(2+2+254) +
      authType_tlv(2+2+16) + socialProof_tlv(2+2+16)) + wiggle(16).
    - Bumped max size to the next power of two for possible SQL engine
      optimizations, and insurance wiggle room in the event of miscalculation
      or adding something and forgetting to increase the size.

  session_info__token_encrypted VARBINARY(1024) NOT NULL,
    The thought token used to originally create the session.
    Used to prevent duplicate session rows created from a given thought.
    In practice this case will be caught by the fact that most people aren't
    trying to hack the system. But if you were trying to do that, you could
    e.g. push a thought, click remember me, delete your browser cookies,
    then press remember me again, etc. So this is kind of a mitigation against
    blowing up the number of rows in this table. There are always ways of
    blowing it up but they're more tedious.

  thought_token__hashed BINARY(64) NOT NULL UNIQUE,
    (1) Set to one year in the future after session start.
    (2) Bumped another year on certain interactions like pushing a thought and
        checking a thought.
    (3) Therefore used to clean up old sessions that haven't been touched in a

  session_expiration_time BIGINT(20) NOT NULL
Event Table
  Tracks anonymous patronage metrics by logging data at significant events.
  The idea is for this data to be freely available and of no particular value
  to anyone except to satisfy curiosity and help improve the service. Queries
  on this table can tell you all sorts of things like how many thoughts are
  matched within a given time period, match percentages with and without
  hints, how long typical matches take, broad service error diagnostics. All
  kinds of cool things! The table is meant to be ever-growing and periodically
  archived, again eventually publically downloadable as well.

  If made public, the rows might have to be shuffled in case by default they
  are stored/served from the SQL database in the order they were written,
  since this could let both parties identify each others' events if traffic is
  low enough.
  Eventually, if the service is popular enough and standard hashtags start to
  get used, it would be nice to log the hashtag for each event. There would
  have to be a whitelist of loggable hashtags in order to avoid logging
  hashtags that themselves contain sensitive information, like if two people
  use Thoughter with a prearranged secret "password"-type hashtag. And a
  hashtag would probably only be logged for one party, not both, to make
  association between two parties impossible or really hard.

  Below you will find repeated reference to a "patronage funnel". This means
  the progression of possible actions a Patron can cause or have happen to
  them using the service. They are enumerated below in `event_action`.

  A current weakness with this table is that, if the data is made public,
  either intentionally or through leaks, then it may be possible for two
  Patrons with a thought match to see what thought duration the other Patron
  chose. I'm too tired to describe it fully, but it can be done through
  deanonymization techniques using `event_timing`. The latter becomes less of
  an issue as the service becomes more popular. But I leave further details as
  an exercise to the reader.

  There are several fixes that could work individually or together:
  (1) Simply don't publicly release any data (intentionally).
  (2) Obfuscate the data before publicly releasing.
  (3) Get rid of the leak completely once this `event` table has enough data
      that we don't really need to track anything related to thought duration.
  (4) Simply don't care about the weakness since it's not a huge deal.

  These fixes all have their various tradeoffs, including decreased accuracy,
  consistency, and completeness of the event tracking.

    An event_id, AKA big random 8-byte integer, is generated and encrypted
    into TLVs alongside other data in various tables like `thought`. It's then
    copied here for every step in the patronage funnel enumerated in
    `event_action`. It will thus be the same for every action for a given
    Patron's experience through the funnel.

  event_id BIGINT(20) NOT NULL,
    The Patron actions (or reactions) we're interested in. There are a few
    different combinations of flows through the funnel that can happen. The
    actions are listed as chronologically as possible though.
    For various reasons, all the `OPTOUT__THOUGHTER__*` actions are 1-funnel long with
    their own `event_id` of zero.

  event_action ENUM (

    A psuedo-enumeration of application result codes for the various actions
    that a Patron can trigger. So this will be some kind of "success" status
    hopefully most of the time, and otherwise indicate the overall type of
    logical failure in the app. `event_status_info` may provide more
    information, especially useful for exceptional failure situations.
    TODO: This should be a proper SQL enumeration once the dust settles.

  event_status VARCHAR(128) NOT NULL,
    For now simply provides a way to add extra diagnostic information if
    `event_status` indicates some kind of failure, especially a rare one. More
    precisely, for now at least, this is the class name of the Java Exception
    that was returned or thrown by a piece of application logic. Either that
    or "NA" (not applicable).
    There may be more stuff thrown in here in the future, but it of course
    should never contain any personally identifiable information.
    TODO: Make nullable.

  event_status_info VARCHAR(255) NOT NULL,
    For event_action.PUSH_THOUGHT this is an epoch timestamp, rounded to the
    nearest hour as additional defense against deanonymization. Otherwise it
    is the epoch time delta since the previous event in the funnel. This is
    a tricky column as far as not leaking any unwanted information when this
    table is made public. For example *just* using timestamps for each event
    in a funnel would be simpler, but wouldn't allow us to do the rounding
    obfuscation without losing considerable accuracy in the time deltas.
    Also as a minor bonus I think storing time deltas makes queries easier to
    write in general, since application code has already done some of the
    "pre-processing". Some specific queries are harder with deltas of course.
    In the future some or all of the deltas might be rounded to the nearest
    second or minute or hour or *something* to further make any kind of
    deanonymizing link between an individual Patron and their event funnel
    harder. For now I'm keeping it exact to the millisecond because there's
    too much value for improving the service so early after its release. And
    anyway it becomes less of a concern as more people use the service and the
    noise in the data makes it harder to identify yourself and the other
    Patron by timing analysis.
    Either way all this will be considered more as the service matures and
    before this table is actually made public, since proofing metadata against
    deanonymizing tactics is a tricky problem.

  event_interval_timing BIGINT(20) NOT NULL,
    The duration in nanoseconds that the logic behind the event took to
    execute. Basically measures how long our SQL queries, emails, and crypto
    are taking on a per-event basis. This column is so far "unique" in that
    it represents nanoseconds, unlike other time-related columns in the
    database as of this writing, which all represent milliseconds.

  event_logic_timing BIGINT(20) NOT NULL,

    An arbitrary integer value that is event-specific. For example for
        `event_action.PUSH_THOUGHT` this will be the thought duration in
    TODO: Make nullable. Requires some Ikwyt improvements.

  event_integer_value BIGINT(20) NOT NULL,
    An arbitrary integer value that is event-specific. For example for
        `event_action.SEND_HINT` this will be the hint type.

  event_string_value VARCHAR(255) DEFAULT NULL
Optout Table
  The data in this table helps prevent the sending of unsolicited emails to
  Patrons who have chosen to ignore either (a) *all* address confirmation and
  hint emails, or (b) hint emails from another specific Patron. For now
  a Patron has to receive at least one unsolicited email then click a link
  in that email that ultimately adds a row to this table. So TODO is perhaps
  have a page where Patrons can *proactively* block specific people. Another
  TODO is to have an automated way to reverse it. For now relying on people
  writing to email@aytwit.com and asking to reverse the optout, which will
  be done manually until it becomes a headache.

    Same basic treatment as `thought.thought_token__hashed`.
    - The original unhashed token is Base64-ed and sent in a hyperlink to
      the Patron and they use this to anonymously activate the row by matching
      hashes with this column.

  optout_token__hashed BINARY(64) NOT NULL PRIMARY KEY,
    This column allows us to recover the original
    optout token so rows in this table are reusable, because the rows'
    crypto costs a fair amount of CPU time. An alternate scheme could be used
    that doesn't reuse rows, but that opens the possibility of DoS-type
    attacks that could flood this table.
    (1) Upon an optoutable email about to be delivered, generate secure-random
        32 bytes, same token as what's hashed into `optout_token__hashed`.
    (2) AES-128 encrypt using from/to email TLV as the key's secret material,
        with 24 random bytes prepended for salt and IV.
    (3) AES-128 encrypt with key only accessible by app server (AKA pepper,
        defense in depth), again using 24 random bytes (prepended) for salt
        and IV.
    (4) NULL the value for this column in a given row when an optout link from
        an email is clicked.
    - Basically same as `thought_mine_partial__thashes_encrypted__peppered`
      in the `thought` table. See more notes there.
    - When this column is set to NULL it means that the given row is
      "activated" and will prevent emails from being sent to the sender
      address hashed into `optout_hash__peppered`.

  optout_token__ashes_encrypted__peppered BINARY(120) DEFAULT NULL,
    Same basic crypto/privacy treatment as `thought.thashes__peppered`.
    This column stores a peppered hash of either an individual's email
    address, or a from/to combination. The former means the Patron in question
    does not want any email from aytwit.com, period. Not even email address
    confirmation emails. The latter is for when a Patron Alice doesn't want
    hint emails from another specific Patron Bob, even though Alice will
    probably usually not know that it's Bob, unless the hint email itself
    tells her, or she can guess.

  optout_ashes__peppered BINARY(80) NOT NULL UNIQUE,
    The time when the row was inserted, then the time this row becomes active
    when `optout_token__thashes_encrypted__peppered` is set to NULL.
    There will be periodic automated cleanup of this table, DELETEing any rows
    where `optout_token__thashes_encrypted__peppered` is not NULL and it has
    been more than n days since the timestamp given here. Probably a month or
    something, but the cleanup process doesn't even exist yet.
    NOTE that the initial time saved here is randomly offset by 16 days from
    the true time that the write was performed. This is to make it extremely
    unlikely to be able to link a row in this table to any rows in other
    tables, particularly `event`. The exact time isn't important here so we
    can afford to be much more paranoid.

  optout_last_updated_time BIGINT(20) NOT NULL
Subscriber Table
  A table of encrypted email addresses of people who are subscribing or
  subscribed to atywit.com's newsletter. See https://aytwit.com/subscribe

  The best this table can do from a pure privacy perspective is to "pepper"
  any sensitive columns, so that if *just* the database is leaked, then the
  hackers can't get plaintext email addresses without also getting the
  application server's encryption key.

CREATE TABLE subscriber (
    Same basic treatment as `optout.optout_token__ashes_encrypted__peppered`.

  subscription_token__peppered BINARY(48) NOT NULL PRIMARY KEY,
    Same basic treatment as `optout.optout_ashes__peppered`.

  subscriber_ashes__peppered BINARY(80) NOT NULL UNIQUE,
    Same basic crypto/privacy treatment as `thought.thashes__peppered`.
    This column stores an individual's email address peppered as well as the
    value for `event.event_id`.
    This encrypted column is unique in all of Aytwit because technically the
    Aytwit application can get the plaintext of this column. It *has* to be
    able to in order to know who to send an email to! :)
    All other encrypted columns in this database require information that only
    a Patron either knows or has in order to unlock them.

  subscriber_info__peppered VARBINARY(272) NOT NULL,
    What it says on the box.

  subscription_active BOOLEAN NOT NULL,
    We want to limit the number of times a person can subscribe in case
    (for example) they're spamming the wrong email address with address
    confirmation emails. Even if it's their email address and the emails
    are just not coming through, well it's not like more emails are
    going to succeed.

  subscribe_attempt_count TINYINT NOT NULL,
    Same idea as `optout.optout_last_updated`.

  subscriber_last_updated_time BIGINT(20) NOT NULL
Donor Table
  A table of encrypted email addresses and credit card charge tokens of
  gracious people who have chosen to donate to Aytwit on a monthly or yearly
  basis. One-time donations don't have entries here. The donation event is
  recorded in the `event` table but otherwise there's no need to store
  one-time donors.
  See the `subscriber` table for a discussion of the similar privacy
  protections used here.

    Same basic treatment as `subscriber.subscription_token__peppered`.

  donor_token__peppered BINARY(48) NOT NULL PRIMARY KEY,
    Same basic treatment as `subscriber.subscriber_hash__peppered`.

  donor_ashes__peppered BINARY(80) NOT NULL UNIQUE,
    Same basic treatment as `subscriber.subscriber_info__peppered`.

  donor_info__peppered VARBINARY(272) NOT NULL,
    What it says on the box.

  donation_frequency ENUM (
    Same basic treatment as `subscriber.subscribe_attempt_count`.

  donate_attempt_count TINYINT NOT NULL,
    Same idea as `subscriber.subscription_last_updated_time`.

  donor_last_updated_time BIGINT(20) NOT NULL