ML embeddings¶
This tutorial shows how to embed light curves with pretrained ONNX models
from the light_curve.embed submodule.
Models are distributed as ONNX files and downloaded from HuggingFace Hub by from_hf().
Requires: pip install onnxruntime (and optionally huggingface_hub for automatic downloads)
In [1]:
Copied!
import numpy as np
from light_curve.embed import Astromer2
model = Astromer2.from_hf(output="mean")
print(f"Model loaded. Max sequence length: {model.seq_size}")
rng = np.random.default_rng(0)
time = np.sort(rng.uniform(0, 500, 120)).astype(np.float64)
mag = rng.normal(15, 0.5, 120).astype(np.float64)
embedding = model(time, mag)
print(f"Output shape: {embedding.shape} # (n_bands, n_subsamples, seq_windows, embed_dim)")
print(f"Squeezed: {embedding.squeeze().shape}")
import numpy as np
from light_curve.embed import Astromer2
model = Astromer2.from_hf(output="mean")
print(f"Model loaded. Max sequence length: {model.seq_size}")
rng = np.random.default_rng(0)
time = np.sort(rng.uniform(0, 500, 120)).astype(np.float64)
mag = rng.normal(15, 0.5, 120).astype(np.float64)
embedding = model(time, mag)
print(f"Output shape: {embedding.shape} # (n_bands, n_subsamples, seq_windows, embed_dim)")
print(f"Squeezed: {embedding.squeeze().shape}")
Model loaded. Max sequence length: 200 Output shape: (1, 1, 1, 256) # (n_bands, n_subsamples, seq_windows, embed_dim) Squeezed: (256,)
Astromer2 — multi-band¶
Pass bands=[...] to embed each band independently.
The model returns one embedding per band:
In [2]:
Copied!
model_gr = Astromer2.from_hf(output="mean", bands=["g", "r"])
rng2 = np.random.default_rng(1)
n = 120
time_gr = np.sort(rng2.uniform(0, 400, n)).astype(np.float64)
mag_gr = rng2.normal(15, 0.4, n).astype(np.float64)
band_gr = np.array(["g", "r"] * (n // 2))
emb_gr = model_gr(time_gr, mag_gr, band=band_gr)
print(f"Output shape: {emb_gr.shape} # (2 bands, n_subsamples, seq_windows, embed_dim)")
model_gr = Astromer2.from_hf(output="mean", bands=["g", "r"])
rng2 = np.random.default_rng(1)
n = 120
time_gr = np.sort(rng2.uniform(0, 400, n)).astype(np.float64)
mag_gr = rng2.normal(15, 0.4, n).astype(np.float64)
band_gr = np.array(["g", "r"] * (n // 2))
emb_gr = model_gr(time_gr, mag_gr, band=band_gr)
print(f"Output shape: {emb_gr.shape} # (2 bands, n_subsamples, seq_windows, embed_dim)")
Output shape: (2, 1, 1, 256) # (2 bands, n_subsamples, seq_windows, embed_dim)
In [3]:
Copied!
from light_curve.embed import ATCAT
model_atcat = ATCAT.from_hf(output="last")
print(f"ATCAT loaded. Max sequence length: {model_atcat.seq_size}")
rng3 = np.random.default_rng(2)
n3 = 150
time3 = np.sort(rng3.uniform(0, 500, n3)).astype(np.float32)
flux3 = rng3.normal(100, 15, n3).astype(np.float32) # flux in nJy
flux_err3 = np.full(n3, 5.0, dtype=np.float32)
band3 = np.array([i % 6 for i in range(n3)]) # u=0, g=1, r=2, i=3, z=4, Y=5
emb3 = model_atcat(time3, flux3, flux_err3, band3)
print(f"Output shape: {emb3.shape} # (1, 1, 1, {emb3.shape[-1]})")
from light_curve.embed import ATCAT
model_atcat = ATCAT.from_hf(output="last")
print(f"ATCAT loaded. Max sequence length: {model_atcat.seq_size}")
rng3 = np.random.default_rng(2)
n3 = 150
time3 = np.sort(rng3.uniform(0, 500, n3)).astype(np.float32)
flux3 = rng3.normal(100, 15, n3).astype(np.float32) # flux in nJy
flux_err3 = np.full(n3, 5.0, dtype=np.float32)
band3 = np.array([i % 6 for i in range(n3)]) # u=0, g=1, r=2, i=3, z=4, Y=5
emb3 = model_atcat(time3, flux3, flux_err3, band3)
print(f"Output shape: {emb3.shape} # (1, 1, 1, {emb3.shape[-1]})")
ATCAT loaded. Max sequence length: 243 Output shape: (1, 1, 1, 384) # (1, 1, 1, 384)
Batch embedding¶
To embed many light curves, call the model in a loop. Embeddings can be concatenated into a matrix for downstream tasks like classification or similarity search:
In [4]:
Copied!
# Embed 10 synthetic light curves
rng4 = np.random.default_rng(3)
light_curves = [
(np.sort(rng4.uniform(0, 300, 80)).astype(np.float64),
rng4.normal(15, 0.3, 80).astype(np.float64))
for _ in range(10)
]
embeddings = np.vstack([
model(t, m).squeeze()[np.newaxis, :]
for t, m in light_curves
])
print(f"Embeddings matrix shape: {embeddings.shape} # (10 objects, 256 dims)")
print("Ready for sklearn, faiss, or any vector search library.")
# Embed 10 synthetic light curves
rng4 = np.random.default_rng(3)
light_curves = [
(np.sort(rng4.uniform(0, 300, 80)).astype(np.float64),
rng4.normal(15, 0.3, 80).astype(np.float64))
for _ in range(10)
]
embeddings = np.vstack([
model(t, m).squeeze()[np.newaxis, :]
for t, m in light_curves
])
print(f"Embeddings matrix shape: {embeddings.shape} # (10 objects, 256 dims)")
print("Ready for sklearn, faiss, or any vector search library.")
Embeddings matrix shape: (10, 256) # (10 objects, 256 dims) Ready for sklearn, faiss, or any vector search library.
Notes¶
- Embeddings have shape
(n_bands, n_subsamples, seq_windows, embed_dim). Use.squeeze()to get a flat vector for a single object. - For GPU inference, pass
ort_session_kwargs={"providers": ["CUDAExecutionProvider"]}tofrom_hf(). huggingface_hubis only needed for automatic downloads viafrom_hf(). If you already have the ONNX file, it is not required.- API reference