xan

médialab Sciences Po · medialab.xan

The CSV magician

xan is a command line tool that can be used to process CSV files directly from the shell. It has been written in Rust to be as fast as possible, use as little memory as possible, and can very easily handle large CSV files (Gigabytes). It leverages a novel SIMD CSV parser and is also able to parallelize some computations (through multithreading) to make some tasks complete as fast as your computer can allow. It can easily preview, filter, slice, aggregate, sort, join CSV files, and exposes a large collection of composable commands that can be chained together to perform a wide variety of typical tasks. xan also offers its own expression language so you can perform complex tasks that cannot be done by relying on the simplest commands. This minimalistic language has been tailored for CSV data and is way faster than evaluating typical dynamically-typed languages such as Python, Lua, JavaScript etc. Note that this tool is originally a fork of BurntSushi's xsv, but has been nearly entirely rewritten at that point, to fit SciencesPo's médialab use-cases, rooted in web data collection and analysis geared towards social sciences (you might think CSV is outdated by now, but read our love letter to the format before judging too quickly). xan therefore goes beyond typical data manipulation and expose utilities related to lexicometry, graph theory and even scraping. Beyond CSV data, xan is able to process a large variety of CSV-adjacent data formats from many different disciplines such as web archival (.cdx) or bioinformatics (.vcf, .gtf, .sam, .bed etc.). xan is also able to convert to & from many data formats such as json, excel files, numpy arrays etc. using xan to and xan from. See this section for more detail. Finally, xan can be used to display CSV files in the terminal, for easy exploration, and can even be used to draw basic data visualisations.

winget install --id medialab.xan --exact --source winget

Latest 0.58.0

Release Notes

Breaking

  • Stopping to serialize moonblade lists either as joined by some separator or JSON. This was awkard, error-prone & potentially lossy. Use the join function manually to format output when required.
  • As per previous point, dropping xan scrape --sep.
  • Dropping implicit unary function calls in moonblade pipelines. This feature was not well-known, confusing (an indentifier, could be understood as a call in a pipeline, only if not in first position...), and mostly useless now that moonblade has had a proper dot operator.
  • xan plot -A/--aggregate does not take an expression anymore but has an automatic selection of two modes: sum and mean. It should also be faster.
  • Renaming the index function as row_index for clarity.
  • xan agg -C/--along-columns & -M/--along-matrix & xan groupby -C/--along-columns & -M/--along-matrix will not map current column index to the result of the index() function. The col_index() can be now used instead for this very purpose.
  • xan window -g/--groupby does not require the file to be sorted anymore. This means using -g/--groupby will now require the whole file to be buffered into memory by the command. The old behavior can still be used through the -S/--sorted flag, thus aligning the xan window command with the rest of the tool.
  • row_index will now error if the expression has no concept of row index, instead of returning nothing.
  • xan parallel -z/--compress now take the desired compression (either gzip or zstd).
  • Retiring the xan grep command in favor of xan search -Z/--fast-parser.
  • xan tokenize --keep short flag becomes -k instead of -K to harmonize with other commands.
  • Retiring the xan flatmap command in favor of xan explode -e.
  • Retiring the xan fuzzy-join command in favor of a consolidated xan join command.
  • Changing xan from -f txt -c default to line instead of value.
  • Renaming xan join -L/--prefix-left & -R/--prefix-right short flags to -l & -r respectively to avoid colliding with the added -R/--reverse flag that can be used for merge joins.
  • Dropping xan plot -B/--bars. It never worked very well and its use-case will be redirected to xan spark.
  • Changing xan heatmap --width short flag from -w to -W so that adding a -H/--height flag remain consistent and avoids clashing with -h/--help.
  • Dropping xan heatmap --show-gradients in favor of xan help gradients.
  • Renaming xan search -A/--all flag to --every-column for clarity and avoid clash with -A/--after-context.
  • Dropping xan sort -U/--unstable. It was never used and the performance boost it supposedly provides cannot be observed. Features
  • Adding xan parallel --dont-chunk.
  • Adding nullary col, col_index & header variants, to work with expression applied in series to multiple columns at once.
  • Adding prev_col & next_col functions.
  • Adding xan (search|filter) -B/--before-context & -A/--after-context.
  • Adding xan window -O/--overwrite.
  • Adding xan map -C/--along-columns.
  • Adding xan window -C/--along-columns.
  • Adding xan cat rows --raw, -P/--preprocess & -H/--shell-preprocess.
  • Improving xan select DSL star selectors. You can now do stuff like vec_*count, *[1], vec*[1] etc.
  • xan p -H/--shell-preprocess now works on Windows.
  • Adding native zsh completions (@apcamargo).
  • Adding xan dedup --u32.
  • Adding xan explode -e/--evaluate, -f/--evaluate-file, --pad & -k/--keep.
  • xan to npy is now able to stream.
  • Adding xan parallel top & xan top -p/--parallel, -t/--threads.
  • Adding xan network edgelist --range.
  • Adding xan network nodelist.
  • Adding the xan run command.
  • Adding xan view --name.
  • Adding xan join -S/--sorted, -R/--reverse & -N/--numeric.
  • Adding xan parallel --run & xan cat rows --run.
  • Adding xan to md -l/--limit.
  • Adding the xan spark command.
  • Adding xan stats -R/--report, --color, --cols, --sep.
  • Adding xan (freq|p freq) -X/--approx-algo.
  • Adding xan plot -D/--density-gradient, --density-scale, --hide-legend, --hide-x-axis, --hide-y-axis, --hide-all & -Q/--square.
  • xan separate will now avoid emitting columns with an empty name given to --into.
  • Adding xan separate --txt & --F/--filter.
  • Adding pow & sqrt scales. Fixes
  • Fixing issues related to nested lambdas in expressions.
  • Fixing xan rename consistency regarding CRLF newlines and first row normalization when using -n/--no-headers.
  • Fixing xan map --overwrite --filter.
  • Fixing lead window function when there is not enough rows ahead.
  • Fixing xan network --format not being validated early enough.
  • Fixing xan explode -D/--drop-empty when selecting multiple columns.
  • Fixing xan merge -u row precedence.
  • Fixing xan join -D/--drop-key automatic selection when using --full.
  • Fixing granularity inference of xan plot -T.
  • Fixing xan from -f (json|ndjson) to emit empty outputs from empty inputs.
  • Fixing xan headers layout when input files have a very large number of columns (>= 1000).
  • Fixing arity validation of top, argtop, most_common & most_common_counts aggregation functions. Performance
  • moonblade expressions are now faster overall and allocate more cautiously, thus saving memory.
  • Improving performance of xan transform, xan flatmap, xan agg & xan groupby.
  • Improving performance of xan rename.
  • Faster xan range.
  • Faster xan parallel -H/--shell-preprocess.
  • Faster xan tokenize words.
  • Adding fast path for xan explode when only a single column is selected.
  • Faster xan sort -e. Quality of Life
  • xan plot will now display label in legends.
  • xan cat rows will now error when input have inconsistent columns.
  • Automatic column alignement with xan to md.
  • xan from now consider .log files as text lines.

Installer type: zip

Architecture Scope Download SHA256
x64 Download 1390BD9202799FD7469BB9163039998AD7F96BC4ED3CE2BF2BB8550EC3FC167F

Details

Homepage
https://github.com/medialab/xan
License
Unlicense, MIT
Publisher
médialab Sciences Po
Support
https://github.com/medialab/xan/issues
Copyright
Copyright (c) 2015-2026 Andrew Gallant, Guillaume Plique

Tags

csvdatadatasciencediagramgraphgraphicsplotstatisticsstats

Older versions (7)

0.57.1
Architecture Scope Download SHA256
x64 Download D2539B421CEA8CAB24BAF5B70CAD3CB299B4997EC38384336C0779CD0A4E7CAC
0.57.0
Architecture Scope Download SHA256
x64 Download DC07694298DEA6A777A3C70AED06818FBDD036D5A0FBD38C37027149EFE0A5B6
0.56.0
Architecture Scope Download SHA256
x64 Download 30447BB352627DD70AFB6B9DEC4BC52DA2D27F6A5175DA944E67133C1D183490
0.55.0
Architecture Scope Download SHA256
x64 Download 28766FB75AD0F1A046A9F86D0D81398A484DC88003CB09D33063B6E45F65E00C
0.54.1
Architecture Scope Download SHA256
x64 Download 7259A235660AC837EFCC848AE6A9E6394C767931D7E27E11621AE9F413DB68E7
0.54.0
Architecture Scope Download SHA256
x64 Download 766B9BE6224E7ACD118196014C2562BBD8ABAADD0A414A3FB5F83A32B0B068DC
0.53.0
Architecture Scope Download SHA256
x64 Download 5D168D53F0ED6B87A61EA96FD7AECE7DC53B05D6DEB6D50CFCC3E778DAD340F0