ML embeddings¶

This tutorial shows how to embed light curves with pretrained ONNX models from the light_curve.embed submodule.

Models are distributed as ONNX files and downloaded from HuggingFace Hub by from_hf().

Requires: pip install onnxruntime (and optionally huggingface_hub for automatic downloads)

Astromer2 — single-band embeddings¶

Astromer2 is pretrained on MACHO light curves and produces 256-dimensional embeddings from irregularly-sampled (time, mag) pairs.

In [1]:

Copied!





import numpy as np
from light_curve.embed import Astromer2

model = Astromer2.from_hf(output="mean")
print(f"Model loaded. Max sequence length: {model.seq_size}")

rng = np.random.default_rng(0)
time = np.sort(rng.uniform(0, 500, 120)).astype(np.float64)
mag  = rng.normal(15, 0.5, 120).astype(np.float64)

embedding = model(time, mag)
print(f"Output shape: {embedding.shape}  # (n_bands, n_subsamples, seq_windows, embed_dim)")
print(f"Squeezed:     {embedding.squeeze().shape}")
import numpy as np
from light_curve.embed import Astromer2

model = Astromer2.from_hf(output="mean")
print(f"Model loaded. Max sequence length: {model.seq_size}")

rng = np.random.default_rng(0)
time = np.sort(rng.uniform(0, 500, 120)).astype(np.float64)
mag  = rng.normal(15, 0.5, 120).astype(np.float64)

embedding = model(time, mag)
print(f"Output shape: {embedding.shape}  # (n_bands, n_subsamples, seq_windows, embed_dim)")
print(f"Squeezed:     {embedding.squeeze().shape}")

Model loaded. Max sequence length: 200
Output shape: (1, 1, 1, 256)  # (n_bands, n_subsamples, seq_windows, embed_dim)
Squeezed:     (256,)

Astromer2 — multi-band¶

Pass bands=[...] to embed each band independently. The model returns one embedding per band:

In [2]:

Copied!





model_gr = Astromer2.from_hf(output="mean", bands=["g", "r"])

rng2 = np.random.default_rng(1)
n = 120
time_gr = np.sort(rng2.uniform(0, 400, n)).astype(np.float64)
mag_gr  = rng2.normal(15, 0.4, n).astype(np.float64)
band_gr = np.array(["g", "r"] * (n // 2))

emb_gr = model_gr(time_gr, mag_gr, band=band_gr)
print(f"Output shape: {emb_gr.shape}  # (2 bands, n_subsamples, seq_windows, embed_dim)")
model_gr = Astromer2.from_hf(output="mean", bands=["g", "r"])

rng2 = np.random.default_rng(1)
n = 120
time_gr = np.sort(rng2.uniform(0, 400, n)).astype(np.float64)
mag_gr  = rng2.normal(15, 0.4, n).astype(np.float64)
band_gr = np.array(["g", "r"] * (n // 2))

emb_gr = model_gr(time_gr, mag_gr, band=band_gr)
print(f"Output shape: {emb_gr.shape}  # (2 bands, n_subsamples, seq_windows, embed_dim)")

Output shape: (2, 1, 1, 256)  # (2 bands, n_subsamples, seq_windows, embed_dim)

ATCAT — 6-band LSST model¶

ATCAT processes all ugrizY bands jointly and returns 384-dimensional embeddings. Inputs are flux, flux error, time, and integer band index (u=0, g=1, r=2, i=3, z=4, Y=5).

In [3]:

Copied!





from light_curve.embed import ATCAT

model_atcat = ATCAT.from_hf(output="last")
print(f"ATCAT loaded. Max sequence length: {model_atcat.seq_size}")

rng3 = np.random.default_rng(2)
n3 = 150
time3     = np.sort(rng3.uniform(0, 500, n3)).astype(np.float32)
flux3     = rng3.normal(100, 15, n3).astype(np.float32)  # flux in nJy
flux_err3 = np.full(n3, 5.0, dtype=np.float32)
band3     = np.array([i % 6 for i in range(n3)])  # u=0, g=1, r=2, i=3, z=4, Y=5

emb3 = model_atcat(time3, flux3, flux_err3, band3)
print(f"Output shape: {emb3.shape}  # (1, 1, 1, {emb3.shape[-1]})")
from light_curve.embed import ATCAT

model_atcat = ATCAT.from_hf(output="last")
print(f"ATCAT loaded. Max sequence length: {model_atcat.seq_size}")

rng3 = np.random.default_rng(2)
n3 = 150
time3     = np.sort(rng3.uniform(0, 500, n3)).astype(np.float32)
flux3     = rng3.normal(100, 15, n3).astype(np.float32)  # flux in nJy
flux_err3 = np.full(n3, 5.0, dtype=np.float32)
band3     = np.array([i % 6 for i in range(n3)])  # u=0, g=1, r=2, i=3, z=4, Y=5

emb3 = model_atcat(time3, flux3, flux_err3, band3)
print(f"Output shape: {emb3.shape}  # (1, 1, 1, {emb3.shape[-1]})")

ATCAT loaded. Max sequence length: 243
Output shape: (1, 1, 1, 384)  # (1, 1, 1, 384)

Batch embedding¶

To embed many light curves, call the model in a loop. Embeddings can be concatenated into a matrix for downstream tasks like classification or similarity search:

In [4]:

Copied!





# Embed 10 synthetic light curves
rng4 = np.random.default_rng(3)
light_curves = [
    (np.sort(rng4.uniform(0, 300, 80)).astype(np.float64),
     rng4.normal(15, 0.3, 80).astype(np.float64))
    for _ in range(10)
]

embeddings = np.vstack([
    model(t, m).squeeze()[np.newaxis, :]
    for t, m in light_curves
])
print(f"Embeddings matrix shape: {embeddings.shape}  # (10 objects, 256 dims)")
print("Ready for sklearn, faiss, or any vector search library.")
# Embed 10 synthetic light curves
rng4 = np.random.default_rng(3)
light_curves = [
    (np.sort(rng4.uniform(0, 300, 80)).astype(np.float64),
     rng4.normal(15, 0.3, 80).astype(np.float64))
    for _ in range(10)
]

embeddings = np.vstack([
    model(t, m).squeeze()[np.newaxis, :]
    for t, m in light_curves
])
print(f"Embeddings matrix shape: {embeddings.shape}  # (10 objects, 256 dims)")
print("Ready for sklearn, faiss, or any vector search library.")

Embeddings matrix shape: (10, 256)  # (10 objects, 256 dims)
Ready for sklearn, faiss, or any vector search library.

Notes¶

Embeddings have shape (n_bands, n_subsamples, seq_windows, embed_dim). Use .squeeze() to get a flat vector for a single object.
For GPU inference, pass ort_session_kwargs={"providers": ["CUDAExecutionProvider"]} to from_hf().
huggingface_hub is only needed for automatic downloads via from_hf(). If you already have the ONNX file, it is not required.
API reference