Basic requirements:

  • Dataset size: At least 1000 rows and 5 columns.
  • The first row should have column names.
  • At least one identifier column (e.g. Customer ID, Name, etc.)
  • Columns with values ​​in comma-separated format are treated as one long piece of text instead of different values
  • For example, “Google, Apple, Facebook” are 3 separate values ​​but are treated as a single value.

  • Data should be collected in a single file or table.
  • There should be as few missing/blank values as possible.
  • Personally Identifiable Information (PII) columns (e.g. phone, email, address, etc.) are not required.

    You can create a practical data set by categorising long sentences with separate values.

    Intermediate requirements

    The requirements in the following items are the actions that should be taken for missing data to make your model more efficient.

      Missing Data
  • A single missing value will cause the entire row to be discarded
  • For numeric columns (price, salary, age, etc.), address missing data with 0 or -1
  • For text/categorical columns (gender, country, etc.), address missing data with “Unknown”
  • Advanced requirements

    Technical knowledge is recommended for advanced requirements.

      Data Enrichment

    Create new columns:

    The quality of a dataset is often enhanced by deriving new columns from existing columns or by correlating different datasets.

    For example, deriving age from date of birth, duration from start and end dates of customer subscription or employment period, etc.

    Once new columns are created, unnecessary columns should not be considered for training the data as they are unnecessary information

    Additional columns should be created from comma separated values. Columns with values ​​in comma separated format are treated as one long piece of text instead of different values

    For example, “Google, Apple, Facebook” are 3 separate values ​​but treated as a single value

    Separate columns can be created for each value and filled with 0/1 depending on their existence for a particular row