Build A Small Text-Based Training Dataset

At the time of writing this, my TIL repo has 1749 TILs that I've written by hand over the course of about 11 years. As I started implementing my own naive Byte Pair Encoding implementation, I realized I needed a sizeable and interesting chunk of text to run through it. The aggregate of all my TILs seemed like a great candidate, but I needed a way to bundle them up into a single file.

Here is a one-liner I can run from the root of the TIL repo to do just that:

{ cat README.md; find */ -name '*.md' -print0 | sort -z | xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; } > combined.md

And here is a formatted version of it so that we can see what is going on:

{
  cat README.md; \
  find */ -name '*.md' -print0 \
  | sort -z \
  | xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; \
} > combined.md

Everything inside the curlies ({ ... }) gets evaluated and the results are redirected (>) to combined.md.

Inside the curlies, I first output the contents of README.md with cat. I'm not sure how much value the README.md is adding to my corpus, but I'll keep it for now.

Then I find all markdown files that are nested in a categorical directory. The -print0 is a nice trick to separate each filename with a null byte (\0). This isn't strictly necessary since all of my filenames are well-formed with alphanumeric characters and dashes.

Those filenames get piped to sort -z which sorts them alphabetically. The -z option maintains the null byte as the separator.

The sorted filenames are then piped to an xargs -0 statement which is going to process the filenames one-by-one recognizing the null byte separator. The -I{} is how I tell xargs to replace the occurrence of {} at the end of the line with the current filename. That filename then becomes the second positional argument ($1) to the inlined shell script. That shell script does two things:

  1. it prints out the file separator that I want to use (<|endoftext|>)
  2. it outputs the contents of the current file. Remember, I'm outputting everything so that it can all get redirected into the combined.md file.

When I run it, I get a 2MB file that contains a ton of words and code samples I've written over the years.

Tell us about your project

We build good software through good partnerships. Reach out and we can discuss your business, your goals, and how VisualMode can help.