Similarity search¶
Embeddings are dense fixed-length vectors, so cosine distance is a natural proxy for light-curve
similarity. The example below reads a ZTF DR23 HATS pixel directly
from the public S3 bucket using
nested-pandas and s3fs
(install both with pip install nested-pandas s3fs), embeds all well-observed light curves with
Astromer2, and finds the closest neighbour to a given object by cosine distance.
This pixel contains ~50 k objects passing the quality cuts; embedding takes roughly 12 minutes on M2 Pro (~15 ms per object).
# %pip install light-curve huggingface_hub onnxruntime nested-pandas universal-pathlib
import nested_pandas as npd
import numpy as np
from scipy.spatial.distance import cdist
from upath import UPath
from light_curve.embed import Astromer2
TARGET_OID = 680213300009232 # ZTF r-band light curve
MIN_OBS = 1000
Step 1 — Load data¶
Read one HATS pixel of ZTF DR23 directly from the public S3 bucket and keep only
objects with clean photometry (catflags == 0) and at least 1 000 observations.
nf = npd.read_parquet(
UPath(
"s3://ipac-irsa-ztf/contributed/dr23/lc/hats/ztf_dr23_lc-hats"
"/dataset/Norder=5/Dir=0/Npix=2378/",
anon=True,
)
)
nf = nf.query("lightcurve.catflags == 0").query(f"lightcurve.list_lengths >= {MIN_OBS}")
print(f"Objects after quality cuts: {len(nf):,}")
Objects after quality cuts: 49,166
Step 2 — Embed with Astromer2¶
Load the pretrained model and run it over all light curves with map_rows.
Each call returns a 256-dim vector; the results are stored as a nested column
embedding.value and then stacked into a matrix for distance computation.
model = Astromer2.from_hf(output="mean", reduction="beginning")
def embed_row(hmjd, mag):
return {"embedding.value": model(hmjd, mag).squeeze()}
nf = nf.map_rows(
embed_row,
columns=["lightcurve.hmjd", "lightcurve.mag"],
row_container="args",
append_columns=True,
)
print(f"Embedded {len(nf):,} objects.")
Embedded 49,166 objects.
Step 3 — Find nearest neighbour¶
Stack the embeddings into a matrix and compute cosine distances from the query object to all others. Cosine distance is scale-invariant: only the direction of the embedding vector matters, not its magnitude.
oids = nf["objectid"].to_numpy()
matrix = np.asarray(nf["embedding.value"]).reshape(len(nf), -1)
query_idx = np.where(oids == TARGET_OID)[0][0]
distances = cdist(matrix[query_idx : query_idx + 1], matrix, metric="cosine")[0]
distances[query_idx] = np.inf # exclude the query itself
best_idx = np.argmin(distances)
best_oid = oids[best_idx]
print(f"Query: OID {TARGET_OID}")
print(f"Nearest neighbour: OID {best_oid}, cosine distance {distances[best_idx]:.6f}")
# Nearest neighbour: OID 680113300005170, cosine distance 0.000046
assert best_oid == 680113300005170
assert distances[best_idx] < 0.001
Query: OID 680213300009232 Nearest neighbour: OID 680113300005170, cosine distance 0.000046
The nearest neighbour is OID 680113300005170 — the same physical object
(HZ Her / Her X-1, an X-ray binary) observed in the
g-band, recovered automatically from an r-band query through embedding similarity.
See both objects on SNAD Viewer: query (r-band) · nearest neighbour (g-band)
Next steps¶
- Classification — train a classifier on top of embeddings
- onnxruntime tips — thread control on shared HPC nodes, GPU/CUDA setup
- API reference