Build A Small Text-Based Training Dataset
At the time of writing this, my TIL repo ↗ has 1749 TILs that I've written by hand over the course of about 11 years. As I started implementing my own naive Byte Pair Encoding implementation ↗, I realized I needed a sizeable and interesting chunk of text to run through it. The aggregate of all my TILs seemed like a great candidate, but I needed a way to bundle them up into a single file.
Here is a one-liner I can run from the root of the TIL repo to do just that:
{ cat README.md; find */ -name '*.md' -print0 | sort -z | xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; } > combined.md
And here is a formatted version of it so that we can see what is going on:
{
cat README.md; \
find */ -name '*.md' -print0 \
| sort -z \
| xargs -0 -I{} sh -c 'echo "<|endoftext|>"; cat "$1"' _ {}; \
} > combined.md
Everything inside the curlies ({ ... }) gets evaluated and the results are
redirected (>) to combined.md.
Inside the curlies, I first output the contents of README.md with cat. I'm
not sure how much value the README.md is adding to my corpus, but I'll keep
it for now.
Then I find all markdown files that are nested in a categorical directory. The
-print0 is a nice trick to separate each filename with a null byte (\0).
This isn't strictly necessary since all of my filenames are well-formed with
alphanumeric characters and dashes.
Those filenames get piped to sort -z which sorts them alphabetically. The -z
option maintains the null byte as the separator.
The sorted filenames are then piped to an xargs -0 statement which is going to
process the filenames one-by-one recognizing the null byte separator. The -I{}
is how I tell xargs to replace the occurrence of {} at the end of the line
with the current filename. That filename then becomes the second positional
argument ($1) to the inlined shell script. That shell script does two things:
- it prints out the file separator that I want to use (
<|endoftext|>) - it outputs the contents of the current file. Remember, I'm outputting
everything so that it can all get redirected into the
combined.mdfile.
When I run it, I get a 2MB file that contains a ton of words and code samples I've written over the years.