Batch processing¶

This tutorial covers the .many() method for efficient bulk feature extraction:

Plain Python lists of (t, m, sigma) tuples
nested-pandas with real ZTF survey data
PyArrow List<Struct> arrays
Polars Series

All Arrow-compatible inputs avoid Python-level iteration and pass data to Rust with zero copies.

In [1]:

Copied!

# %pip install light-curve
# %pip install light-curve

Plain list of tuples¶

.many() accepts a list of (t, m, sigma) tuples and returns a 2-D NumPy array of shape (N, n_features). Multi-threading is enabled by default via the n_jobs parameter:

In [2]:

Copied!





import light_curve as licu
import numpy as np

rng = np.random.default_rng(0)
light_curves = [
    (np.sort(rng.random(50)), rng.random(50), rng.random(50) * 0.1)
    for _ in range(1000)
]

results = licu.Amplitude().many(light_curves)
print(f'Extracted from {len(light_curves)} light curves: shape = {results.shape}')
print(f'Mean amplitude = {results.mean():.4f} mag')
import light_curve as licu
import numpy as np

rng = np.random.default_rng(0)
light_curves = [
    (np.sort(rng.random(50)), rng.random(50), rng.random(50) * 0.1)
    for _ in range(1000)
]

results = licu.Amplitude().many(light_curves)
print(f'Extracted from {len(light_curves)} light curves: shape = {results.shape}')
print(f'Mean amplitude = {results.mean():.4f} mag')

Extracted from 1000 light curves: shape = (1000, 1)
Mean amplitude = 0.4806 mag

nested-pandas with ZTF survey data¶

nested-pandas extends pandas with nested Arrow column support, useful for catalog data such as ZTF or Rubin LSST.

In [3]:

Copied!

# %pip install light-curve nested-pandas s3fs universal-pathlib
# %pip install light-curve nested-pandas s3fs universal-pathlib

In [4]:

Copied!





import light_curve as licu
import nested_pandas as npd
import numpy as np
import pyarrow as pa
from upath import UPath

s3_path = UPath(
    "s3://ipac-irsa-ztf/contributed/dr23/lc/hats/ztf_dr23_lc-hats/dataset/Norder=6/Dir=30000/Npix=34623/part0.snappy.parquet",
    anon=True,
)
ndf = npd.read_parquet(
    s3_path,
    columns=["objectid", "lightcurve.hmjd", "lightcurve.mag", "lightcurve.magerr"],
)

ndf = ndf.loc[ndf["lightcurve"].list_lengths > 10]

ndf["lightcurve.t"] = np.asarray(ndf["lightcurve.hmjd"] - 58000, dtype=np.float32)

feature = licu.Extractor(licu.Chi2Pvar(), licu.InterPercentileRange(quantile=0.25), licu.LinearFit())
result = feature.many(pa.array(ndf["lightcurve"]), n_jobs=-1,
                      arrow_fields={"t": "t", "m": "mag", "sigma": "magerr"})

ndf = ndf.assign(**dict(zip(feature.names, result.T)))
ndf.head()
import light_curve as licu
import nested_pandas as npd
import numpy as np
import pyarrow as pa
from upath import UPath

s3_path = UPath(
    "s3://ipac-irsa-ztf/contributed/dr23/lc/hats/ztf_dr23_lc-hats/dataset/Norder=6/Dir=30000/Npix=34623/part0.snappy.parquet",
    anon=True,
)
ndf = npd.read_parquet(
    s3_path,
    columns=["objectid", "lightcurve.hmjd", "lightcurve.mag", "lightcurve.magerr"],
)

ndf = ndf.loc[ndf["lightcurve"].list_lengths > 10]

ndf["lightcurve.t"] = np.asarray(ndf["lightcurve.hmjd"] - 58000, dtype=np.float32)

feature = licu.Extractor(licu.Chi2Pvar(), licu.InterPercentileRange(quantile=0.25), licu.LinearFit())
result = feature.many(pa.array(ndf["lightcurve"]), n_jobs=-1,
                      arrow_fields={"t": "t", "m": "mag", "sigma": "magerr"})

ndf = ndf.assign(**dict(zip(feature.names, result.T)))
ndf.head()

Out[4]:

objectid

lightcurve

chi2_pvar

inter_percentile_range_25

linear_fit_slope

linear_fit_slope_sigma

linear_fit_reduced_chi2

0

1248202100002710

hmjd	mag	magerr	t
59176.3144	19.841591	0.140125	1176.314453
+10 rows	...	...	...

0.000000

0.305351

0.001163

0.000185

5.509358

1

1248202100002733

hmjd	mag	magerr	t
59176.31488	19.836166	0.139713	1176.314819
+15 rows	...	...	...

0.204448

0.216278

0.000245

0.000145

1.167804

2

1248202100002739

hmjd	mag	magerr	t
59124.43823	19.081635	0.086184	1124.438232
+22 rows	...	...	...

0.734284

0.092379

0.000048

0.000083

0.818141

5

1248202100002819

hmjd	mag	magerr	t
59124.43823	17.424206	0.029763	1124.438232
+25 rows	...	...	...

0.000120

0.042589

0.000031

0.000028

2.430087

8

1248202100002918

hmjd	mag	magerr	t
59124.43823	18.022591	0.041814	1124.438232
+25 rows	...	...	...

0.008479

0.045643

0.000043

0.000040

1.825333

5 rows x 7 columns

PyArrow¶

PyArrow is the reference Python implementation of Apache Arrow. Pass a List<Struct<t, m, band>> array directly to .many() for multiband extraction without sigma.

In [5]:

Copied!

# %pip install light-curve pyarrow
# %pip install light-curve pyarrow

In [6]:

Copied!





import light_curve as licu
import numpy as np
import pyarrow as pa

BANDS = ["g", "r"]
rng = np.random.default_rng(42)
n_lc, n_per_band = 200, 40

struct_type = pa.struct([
    ("t", pa.float64()),
    ("m", pa.float64()),
    ("band", pa.string()),
])


def make_lc():
    rows = []
    for b in BANDS:
        t = rng.uniform(0, 100, n_per_band)
        m = rng.normal(15.0 if b == "g" else 15.3, 0.3, n_per_band)
        rows.extend({"t": float(ti), "m": float(mi), "band": b} for ti, mi in zip(t, m))
    rows.sort(key=lambda r: r["t"])
    return rows


lcs_arrow = pa.array([make_lc() for _ in range(n_lc)], type=pa.list_(struct_type))

feature = licu.Extractor(
    licu.InterPercentileRange(quantile=0.1, bands=BANDS),  # robust amplitude per band
    licu.AndersonDarlingNormal(bands=BANDS),  # normality test per band
    licu.ColorOfMaximum(BANDS),  # colour at brightness peak
    licu.ColorOfMinimum(BANDS),  # colour at brightness trough
)
result = feature.many(
    lcs_arrow,
    sorted=True,
    arrow_fields={"t": "t", "m": "m", "band": "band"},
)
print(f"shape: {result.shape}")  # (200, 6)
print("names:", feature.names)
import light_curve as licu
import numpy as np
import pyarrow as pa

BANDS = ["g", "r"]
rng = np.random.default_rng(42)
n_lc, n_per_band = 200, 40

struct_type = pa.struct([
    ("t", pa.float64()),
    ("m", pa.float64()),
    ("band", pa.string()),
])


def make_lc():
    rows = []
    for b in BANDS:
        t = rng.uniform(0, 100, n_per_band)
        m = rng.normal(15.0 if b == "g" else 15.3, 0.3, n_per_band)
        rows.extend({"t": float(ti), "m": float(mi), "band": b} for ti, mi in zip(t, m))
    rows.sort(key=lambda r: r["t"])
    return rows


lcs_arrow = pa.array([make_lc() for _ in range(n_lc)], type=pa.list_(struct_type))

feature = licu.Extractor(
    licu.InterPercentileRange(quantile=0.1, bands=BANDS),  # robust amplitude per band
    licu.AndersonDarlingNormal(bands=BANDS),  # normality test per band
    licu.ColorOfMaximum(BANDS),  # colour at brightness peak
    licu.ColorOfMinimum(BANDS),  # colour at brightness trough
)
result = feature.many(
    lcs_arrow,
    sorted=True,
    arrow_fields={"t": "t", "m": "m", "band": "band"},
)
print(f"shape: {result.shape}")  # (200, 6)
print("names:", feature.names)

shape: (200, 6)
names: ['inter_percentile_range_10_g', 'inter_percentile_range_10_r', 'anderson_darling_normal_g', 'anderson_darling_normal_r', 'color_max_g_r', 'color_min_g_r']

Polars¶

Polars is a fast DataFrame library built on Arrow. Group a flat multiband DataFrame by object and pass the nested Series to .many().

In [7]:

Copied!

# %pip install light-curve polars
# %pip install light-curve polars

In [8]:

Copied!





import light_curve as licu
import numpy as np
import polars as pl

BANDS = ["g", "r"]
rng = np.random.default_rng(42)
n_obj, n_per_band = 200, 40

object_id = np.repeat(np.arange(n_obj), n_per_band * len(BANDS))
band_col = np.tile(np.repeat(BANDS, n_per_band), n_obj)
t = np.sort(rng.uniform(0, 100, n_obj * n_per_band * len(BANDS)))
m = rng.normal(15.0, 0.3, len(object_id))
sigma = rng.uniform(0.01, 0.1, len(object_id))

df = pl.DataFrame({"object_id": object_id, "band": band_col, "t": t, "m": m, "sigma": sigma})
nested = df.group_by("object_id").agg(pl.struct("t", "m", "sigma", "band").alias("lc"))

feature = licu.Extractor(
    licu.ExcessVariance(bands=BANDS),  # variability excess over noise per band
    licu.StetsonK(bands=BANDS),  # variability index per band
    licu.BeyondNStd(nstd=1.5, bands=BANDS),  # outlier fraction per band
    licu.ColorOfMedian(BANDS),  # colour at median brightness
    licu.ColorSpread(BANDS),  # std dev of per-band means
)
result = feature.many(
    nested["lc"],
    arrow_fields={"t": "t", "m": "m", "sigma": "sigma", "band": "band"},
)
nested = nested.with_columns(
    [pl.Series(name, result[:, i]) for i, name in enumerate(feature.names)]
)
nested.select(["object_id"] + feature.names)
import light_curve as licu
import numpy as np
import polars as pl

BANDS = ["g", "r"]
rng = np.random.default_rng(42)
n_obj, n_per_band = 200, 40

object_id = np.repeat(np.arange(n_obj), n_per_band * len(BANDS))
band_col = np.tile(np.repeat(BANDS, n_per_band), n_obj)
t = np.sort(rng.uniform(0, 100, n_obj * n_per_band * len(BANDS)))
m = rng.normal(15.0, 0.3, len(object_id))
sigma = rng.uniform(0.01, 0.1, len(object_id))

df = pl.DataFrame({"object_id": object_id, "band": band_col, "t": t, "m": m, "sigma": sigma})
nested = df.group_by("object_id").agg(pl.struct("t", "m", "sigma", "band").alias("lc"))

feature = licu.Extractor(
    licu.ExcessVariance(bands=BANDS),  # variability excess over noise per band
    licu.StetsonK(bands=BANDS),  # variability index per band
    licu.BeyondNStd(nstd=1.5, bands=BANDS),  # outlier fraction per band
    licu.ColorOfMedian(BANDS),  # colour at median brightness
    licu.ColorSpread(BANDS),  # std dev of per-band means
)
result = feature.many(
    nested["lc"],
    arrow_fields={"t": "t", "m": "m", "sigma": "sigma", "band": "band"},
)
nested = nested.with_columns(
    [pl.Series(name, result[:, i]) for i, name in enumerate(feature.names)]
)
nested.select(["object_id"] + feature.names)

Out[8]:

shape: (200, 9)

object_id	excess_variance_g	excess_variance_r	stetson_K_g	stetson_K_r	beyond_2_std_g	beyond_2_std_r	color_median_g_r	color_spread
i64	f64	f64	f64	f64	f64	f64	f64	f64
42	0.000477	0.000423	0.725276	0.685257	0.075	0.075	-0.030723	0.008347
161	0.000328	0.000332	0.713015	0.773202	0.125	0.05	-0.036714	0.042266
191	0.000426	0.000458	0.686024	0.678284	0.125	0.1	0.243003	0.136195
188	0.000251	0.000381	0.799913	0.628956	0.1	0.1	-0.016024	0.023173
197	0.000271	0.000462	0.692505	0.657601	0.1	0.125	-0.029821	0.029909
…	…	…	…	…	…	…	…	…
83	0.000351	0.000289	0.61676	0.738528	0.15	0.175	-0.05787	0.084151
175	0.000302	0.000372	0.624619	0.613073	0.075	0.15	-0.117935	0.060557
44	0.00048	0.00041	0.633319	0.74991	0.15	0.05	0.108399	0.005772
95	0.000245	0.000263	0.705973	0.784646	0.075	0.15	0.02017	0.026593
47	0.000514	0.000542	0.618895	0.710341	0.125	0.075	-0.049602	0.062047

Next steps¶

Feature basics tutorial — single features, Extractor, multiband intro
Multiband tutorial — per-band and cross-band features
Periodogram tutorial — Lomb–Scargle and period search
API reference — full signatures and equations