Data and Implementation Specifications

Data Requirements

General Requirements

When creating exports or queries for data to be delivered to Othot, consider the following:

  1. The data must represent all information about both the positive and negative outcomes (e.g., prospects that enrolled and prospects that did not, or students that retained and students that did not).
  2. Data that is recorded at the end of the cycle and is only for records who achieve the final outcome (e.g., enroll or retain) or highly correlated to the final outcome will not be included in modeling, as it's considered a “leak” variable.
  3. To allow for proper review, display, and use of variables:  
    • All provided data must be identified in the Data Mapping document provided in the onboarding phase (or in supplementary data dictionary documents provided by the client).
    • Any provided variables that are coded (e.g., 1=YES/0=NO, R=RESIDENT/C=COMMUTER, etc.) must be accompanied with a description or decoded version in a supplemental document. 

Data Format Requirements

The following guidelines must be adhered to for all files provided to us. A lack of adherence to these guidelines may increase data configuration time and prevent the processing of files for regular updates.

Item

Guidelines

File Format

  • Files delivered to us must be in .csv format.
  • Each line must contain the same number of fields throughout the file.

Headers

The first row of the file must contain column headers. These headers cannot include special characters or line terminators (CRLF/CR/LF). Example header formats are as follows:

  • HEADER_EXAMPLE (preferred)
  • Header Example

Delimiters

  • Accepted delimiters include: 
    • Comma (,) (preferred)
    • Tab (/t)
    • Pipe (|)
  • Fields containing line terminators, double quotes, commas, or other delimiters must be enclosed in double quotes. 

Encoding

Accepted encoding includes:

  • UTF-8 (preferred)
  • Windows-1252
  • ISO-8859-1

Line Terminators

  • Each row must end with CRLF, CR, or LF characters to indicate the end of a line. 
  • Accepted line terminators include:  
    • \r\n
    • \n
    • \r

Date Formats

Accepted date formats include:

  • YYYY-MM-DD (preferred)
  • DD-MM-YYYY
  • DD/MM/YYYY
  • DD/MM/YYYY [TIMESTAMP]

Required Fields

The following fields are required for all records:

Field

Description & Specifications

Unique Identifier

This field represents a unique student record. Generally, it's a student ID (numeric or alpha-numeric) that is assigned to the individual when they enter the population. This ID must not change throughout the cycle.

Admission Term or Year

This field represents the term or year in which the record is part of the population for the intended outcome. It's used internally to split data into train or predict sets, as well as to evaluate data consistency and model performance year-over-year.

Target Variable

This field represents the target or desired outcome for model training and predicting (e.g., enrollment status/retention status). It can be either a date on which the record achieved the desired outcome or a YES/NO field which indicates whether the record achieved the desired outcome.

Note: this field is generally representative of the final outcome at the institution’s “census day.”

Lifecycle Indicators

Date fields must be provided for each lifecycle phase or step in the process. For example, standard lifecycles for enrollment might include: 

  • Prospect Date
  • Inquired Date
  • Application Started Date
  • Application Completed Date
  • Admitted Date
  • Deposited Date

Note: standard lifecycles for retention are dependent on the High Impact Question (HIQ) selected.

Terminal Node Indicators

Date fields must be provided for each terminal node in the process (i.e., the points at which the individual exits the process). For example, terminal nodes for enrollment might include: 

  • Denied Date
  • Withdrawn Application Date
  • Cancelled Deposit Date

Data Requirements for Platform Functionality

The following requirements are necessary to enable specific functionality: 

Functionality

Requirements

Predictions by Specific Populations

To view predictions by these breakdowns, the data must be provided as a field in the data. It's important you inform us about any groups of records that move through the process differently than others, or any populations that are treated differently during the process. For example:

  • Transfer vs. first-time. 
  • Early action vs. regular decision. 
  • In-state vs. out-of-state. 

Additional Predictions

Common use cases for additional predictions include:

  • Likelihood to transition from one phase to another.
  • Likelihood to be retained. 
  • Likelihood to visit a specific event. 

Note that to enable additional predictions, you must provide data for each record that indicates whether they meet the criteria. For example, the “likelihood to be retained” must have an additional field that indicates whether an enrolled student was retained. Furthermore, additional predictions only provide you with the likelihood score. To perform What-If analyses or see top impacts about the additional prediction, you must create a new HIQ.

What-If Variables

What-If variables represent variables that can be “controlled” or “influenced” by your institution. For a variable to be used in a What-If analysis, it cannot be a "leak" variable and must be:

  • Independent of other variables in the model.
  • Fairly consistent year-over-year (for model training purposes).
  • Selected by the model (i.e. significant importance in modeling the desired outcome).
  • Enabled for the relevant stage (e.g. financial aid only available for the Admit phase or later).

Best Practices for Data Collection

When providing data to us, consider these best practices: 

Category

Best Practices

Multiple File Merge

  • There should be a single “base” file which lists all records, to which additional files can be joined. This file is generally a one-to-one format (i.e., flat).
  • Additional files can be either one-to-one (i.e., flat) or one-to-many (i.e., transactional).
  • Each file should contain a unique identifier field(s) that can be used across all files to join back to the base file. These fields are generally represented by a student ID and an academic term or year (if applicable).
  • If you're providing transactional files, include a dated or date-timestamp field (if available) for each row.

Visit and Event Data

  • This data is best provided in a transactional format, with one row per record per event.
  • A date should be provided to indicate the date on which the individual attended the visit or event. This will enable additional variables to be derived about relationships between the lifecycle phase in which the visit or event occurred.
  • If the names of visits or events are not consistent, it's best to categorize them into groups that represent similar types. Generally, this is done by providing an additional column in the transactional file that represents the visit type. For example: 
    • Campus Visits
    • Counselor Meetings
    • On-Campus Events
    • Off-Campus Events

Interaction Data

  • This data is best provided in a transactional format, with one row per record per aid year per award.
  • A date should be provided to indicate the date on which the interaction occurred. This will enable additional variables to be derived about relationships between the lifecycle phase in which the interaction occurred.
  • If the names of visits or events are not consistent, it's best to categorize into groups that represent similar types. Generally, this is done by providing an additional column in the transactional file that represents the interaction type. For example:
    • Phone Call
    • Email
    • SMS
    • Mailing
  • It's also important to distinguish between interactions initiated by the individual and interactions initiated by the institution. This can be done as part of the interaction type column or as an additional column in the data. (e.g., Incoming vs. Outgoing).

Financial Aid Data

  • This data is best provided in a transactional format, with one row per record per aid year per award.
  • A specific "tuition" or "tuition + mandatory fees" variable should be included for each student so we can derive and display a calculation for Net Tuition Review (NTR = Tuition – Institutional Aid).
  • Financial aid data should include a field or fields that indicate the type of aid (e.g,. Grant, Scholarship, Loan, Work Study, etc.) and the source of aid (e.g., Federal, State, Institutional, Outside, etc.).
  • Aid files should contain the offer amount for all students who were offered awards. It should contain awards for both individuals who achieved the desired target and those who did not.
  • To ensure aid is modeled correctly and predictions are useful, it's beneficial to inform us about any specific aid packaging processes and how they may be expressed in the data. This includes, but is not limited to:
    • Awards that have specific criteria.
    • Awards offered to only specific populations.
    • The timing of award packages.
    • Policies related to meeting expected need.
    • Processes for aid appeals.

Test Score Data

  • As a standard practice, we convert all ACT scores to SAT scale and take the maximum value between the two to derive the “Highest Test Score” achieved by the individual.
  • Data related to test scores can be provided in a number of different ways, but ultimately we need either single fields or the logic to derive single fields that represent both the individual’s ACT and SAT scores. 
  • To provide the best user experience, you should inform us about the preference for ACT vs. SAT so scores can be converted and displayed on your desired scale.