Similarity search¶

Embeddings are dense fixed-length vectors, so cosine distance is a natural proxy for light-curve similarity. The example below reads a ZTF DR23 HATS pixel directly from the public S3 bucket using nested-pandas and s3fs (install both with pip install nested-pandas s3fs), embeds all well-observed light curves with Astromer2, and finds the closest neighbour to a given object by cosine distance.

This pixel contains ~50 k objects passing the quality cuts; embedding takes roughly 12 minutes on M2 Pro (~15 ms per object).

In [1]:

Copied!

# %pip install light-curve huggingface_hub onnxruntime nested-pandas universal-pathlib
# %pip install light-curve huggingface_hub onnxruntime nested-pandas universal-pathlib

In [ ]:

Copied!





import nested_pandas as npd
import numpy as np
from scipy.spatial.distance import cdist
from upath import UPath

from light_curve.embed import Astromer2

TARGET_OID = 680213300009232  # ZTF r-band light curve
MIN_OBS = 1000
import nested_pandas as npd
import numpy as np
from scipy.spatial.distance import cdist
from upath import UPath

from light_curve.embed import Astromer2

TARGET_OID = 680213300009232  # ZTF r-band light curve
MIN_OBS = 1000

Step 1 — Load data¶

Read one HATS pixel of ZTF DR23 directly from the public S3 bucket and keep only objects with clean photometry (catflags == 0) and at least 1 000 observations.

In [3]:

Copied!





nf = npd.read_parquet(
    UPath(
        "s3://ipac-irsa-ztf/contributed/dr23/lc/hats/ztf_dr23_lc-hats"
        "/dataset/Norder=5/Dir=0/Npix=2378/",
        anon=True,
    )
)
nf = nf.query("lightcurve.catflags == 0").query(f"lightcurve.list_lengths >= {MIN_OBS}")
print(f"Objects after quality cuts: {len(nf):,}")
nf = npd.read_parquet(
    UPath(
        "s3://ipac-irsa-ztf/contributed/dr23/lc/hats/ztf_dr23_lc-hats"
        "/dataset/Norder=5/Dir=0/Npix=2378/",
        anon=True,
    )
)
nf = nf.query("lightcurve.catflags == 0").query(f"lightcurve.list_lengths >= {MIN_OBS}")
print(f"Objects after quality cuts: {len(nf):,}")

Objects after quality cuts: 49,166

Step 2 — Embed with Astromer2¶

Load the pretrained model and run it over all light curves with map_rows. Each call returns a 256-dim vector; the results are stored as a nested column embedding.value and then stacked into a matrix for distance computation.

In [4]:

Copied!





model = Astromer2.from_hf(output="mean", reduction="beginning")


def embed_row(hmjd, mag):
    return {"embedding.value": model(hmjd, mag).squeeze()}


nf = nf.map_rows(
    embed_row,
    columns=["lightcurve.hmjd", "lightcurve.mag"],
    row_container="args",
    append_columns=True,
)
print(f"Embedded {len(nf):,} objects.")
model = Astromer2.from_hf(output="mean", reduction="beginning")


def embed_row(hmjd, mag):
    return {"embedding.value": model(hmjd, mag).squeeze()}


nf = nf.map_rows(
    embed_row,
    columns=["lightcurve.hmjd", "lightcurve.mag"],
    row_container="args",
    append_columns=True,
)
print(f"Embedded {len(nf):,} objects.")

Embedded 49,166 objects.

Step 3 — Find nearest neighbour¶

Stack the embeddings into a matrix and compute cosine distances from the query object to all others. Cosine distance is scale-invariant: only the direction of the embedding vector matters, not its magnitude.

In [5]:

Copied!





oids = nf["objectid"].to_numpy()
matrix = np.asarray(nf["embedding.value"]).reshape(len(nf), -1)

query_idx = np.where(oids == TARGET_OID)[0][0]
distances = cdist(matrix[query_idx : query_idx + 1], matrix, metric="cosine")[0]
distances[query_idx] = np.inf  # exclude the query itself

best_idx = np.argmin(distances)
best_oid = oids[best_idx]
print(f"Query:             OID {TARGET_OID}")
print(f"Nearest neighbour: OID {best_oid}, cosine distance {distances[best_idx]:.6f}")
# Nearest neighbour: OID 680113300005170, cosine distance 0.000046

assert best_oid == 680113300005170
assert distances[best_idx] < 0.001
oids = nf["objectid"].to_numpy()
matrix = np.asarray(nf["embedding.value"]).reshape(len(nf), -1)

query_idx = np.where(oids == TARGET_OID)[0][0]
distances = cdist(matrix[query_idx : query_idx + 1], matrix, metric="cosine")[0]
distances[query_idx] = np.inf  # exclude the query itself

best_idx = np.argmin(distances)
best_oid = oids[best_idx]
print(f"Query:             OID {TARGET_OID}")
print(f"Nearest neighbour: OID {best_oid}, cosine distance {distances[best_idx]:.6f}")
# Nearest neighbour: OID 680113300005170, cosine distance 0.000046

assert best_oid == 680113300005170
assert distances[best_idx] < 0.001

Query:             OID 680213300009232
Nearest neighbour: OID 680113300005170, cosine distance 0.000046

The nearest neighbour is OID 680113300005170 — the same physical object (HZ Her / Her X-1, an X-ray binary) observed in the g-band, recovered automatically from an r-band query through embedding similarity.

See both objects on SNAD Viewer: query (r-band) · nearest neighbour (g-band)

Next steps¶

Classification — train a classifier on top of embeddings
onnxruntime tips — thread control on shared HPC nodes, GPU/CUDA setup
API reference