Posts by Tags

AutoML

Automated Data Labeling

The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

6 minute read

Published:

Authors: Tzu-Heng Huang

Large pretrained models like GPT-4, Gemini, and Claude 3 are fantastic at labeling data—-whether it’s spam detection in YouTube comments or classifying topics in medical documents. But there’s a drawback: querying these models for every single data point via API calls gets expensive fast.

Data Source Selection

Data-Centric AI

Tabby: Tabular Data Synthesis With Large Language Models

6 minute read

Published:

Authors: Sonia Cromp

While impressive examples of AI-generated art and dialogue have captured the public’s attention in recent years, one of the most fundamental data formats–tabular data–still lacks specialized, high-performing models. Tables are ubiquitous in modern life, but are not modeled well by off-the-shelf models intended for other datatypes. Given the central role of tabular data in everything from global economic forecasts and astronomical observations to classroom gradebooks and household budgets, the lack of deep learning methods tailored for tables is quite surprising. To address the table synthesis gap, we introduce Tabby: a foundation model designed specifically for tabular data. Tabby introduces the inductive biases necessary to represent tabular data into a pre-trained large language model, avoiding the costly process of training a foundation model from scratch. Read on to discover how Tabby generates synthetic data that is nearly indistinguishable from real-world datasets!

The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

6 minute read

Published:

Authors: Tzu-Heng Huang

Large pretrained models like GPT-4, Gemini, and Claude 3 are fantastic at labeling data—-whether it’s spam detection in YouTube comments or classifying topics in medical documents. But there’s a drawback: querying these models for every single data point via API calls gets expensive fast.

Diverse Tasks

Foundation Models

Tabby: Tabular Data Synthesis With Large Language Models

6 minute read

Published:

Authors: Sonia Cromp

While impressive examples of AI-generated art and dialogue have captured the public’s attention in recent years, one of the most fundamental data formats–tabular data–still lacks specialized, high-performing models. Tables are ubiquitous in modern life, but are not modeled well by off-the-shelf models intended for other datatypes. Given the central role of tabular data in everything from global economic forecasts and astronomical observations to classroom gradebooks and household budgets, the lack of deep learning methods tailored for tables is quite surprising. To address the table synthesis gap, we introduce Tabby: a foundation model designed specifically for tabular data. Tabby introduces the inductive biases necessary to represent tabular data into a pre-trained large language model, avoiding the costly process of training a foundation model from scratch. Read on to discover how Tabby generates synthetic data that is nearly indistinguishable from real-world datasets!

Inference-time steering

LLM-as-data-annotators

The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

6 minute read

Published:

Authors: Tzu-Heng Huang

Large pretrained models like GPT-4, Gemini, and Claude 3 are fantastic at labeling data—-whether it’s spam detection in YouTube comments or classifying topics in medical documents. But there’s a drawback: querying these models for every single data point via API calls gets expensive fast.

Label Distribution Adaptation

Language Models

Multi-modal Models

Non-Euclidean ML

Optimal Transport

Overlap Density

Research

Robust ML

Self-Alignment

Structured Data

Tabby: Tabular Data Synthesis With Large Language Models

6 minute read

Published:

Authors: Sonia Cromp

While impressive examples of AI-generated art and dialogue have captured the public’s attention in recent years, one of the most fundamental data formats–tabular data–still lacks specialized, high-performing models. Tables are ubiquitous in modern life, but are not modeled well by off-the-shelf models intended for other datatypes. Given the central role of tabular data in everything from global economic forecasts and astronomical observations to classroom gradebooks and household budgets, the lack of deep learning methods tailored for tables is quite surprising. To address the table synthesis gap, we introduce Tabby: a foundation model designed specifically for tabular data. Tabby introduces the inductive biases necessary to represent tabular data into a pre-trained large language model, avoiding the costly process of training a foundation model from scratch. Read on to discover how Tabby generates synthetic data that is nearly indistinguishable from real-world datasets!

Structured Prediction

Tabular Data

Tabby: Tabular Data Synthesis With Large Language Models

6 minute read

Published:

Authors: Sonia Cromp

While impressive examples of AI-generated art and dialogue have captured the public’s attention in recent years, one of the most fundamental data formats–tabular data–still lacks specialized, high-performing models. Tables are ubiquitous in modern life, but are not modeled well by off-the-shelf models intended for other datatypes. Given the central role of tabular data in everything from global economic forecasts and astronomical observations to classroom gradebooks and household budgets, the lack of deep learning methods tailored for tables is quite surprising. To address the table synthesis gap, we introduce Tabby: a foundation model designed specifically for tabular data. Tabby introduces the inductive biases necessary to represent tabular data into a pre-trained large language model, avoiding the costly process of training a foundation model from scratch. Read on to discover how Tabby generates synthetic data that is nearly indistinguishable from real-world datasets!

Tensor Decomposition

Weak Supervision

Weak-to-Strong Generalization

Zero-Shot Models

Zero-shot inference