TIL about tr command
Once you are introduced to small self-contained programs in Unix that do one thing well, the pleasure of creatively piping them to perform file surgery never ceases. In this TIL, I share a recent scenario that I dealt with to highlight this pleasure.
Imagine you have a list of files that contain gene names in a particular column but some of those gene names happen to be in the same row. What I mean is this:
chr1 37956834 37957043|Lyg1
chr1 40790937 40791146|Mfsd9;Mfsd9
chr1 95375807 95376016|
chr1 121315462 121315671|Insig2;Insig2;Insig2;Insig2;Insig2
chr1 121970294 121970503|
chr1 121970312 121970521|
chr1 132233726 132233935|Lemd1;Lemd1;Lemd1;Lemd1;Lemd1;Lemd1
chr1 133424315 133424524|Sox13
chr1 186692087 186692296|Tgfb2;Tgfb2
chr1 189731906 189732115|Ptpn14;Ptpn14;Ptpn14
chr1 191965279 191965488|
chr10 8886252 8886461|Sash1;Sash1
chr10 12684207 12684416|Utrn;Utrn
chr10 22212552 22212761|H60b;Raet1a;Raet1c
chr10 45298649 45298858|Popdc3;Popdc3
chr10 61535129 61535338|Lrrc20
chr10 69452106 69452315|AK029407
chr10 75979710 75979919|Gm5134;Gm5134;Gm5134;Gm5134
chr10 77290910 77291119|Adarb1;Adarb1;Adarb1;Adarb1;Adarb1;Adarb1
chr10 79892887 79893096|Bc1;Prtn3;Prtn3;Elane;Cfd;Cfd;Med16;Med16;Med16
Notice the gene names after the |
symbol. What if I wanted to capture all unique gene names in multiple files with hundreds of rows like this? Like this:
Lyg1
Mfsd9
Insig2
Lemd1
Sox13
Tgfb2
Ptpn14
Sash1
Utrn
H60b
Raet1a
Raet1c
Without Unix’s default commands, process will be heavily manual with no pleasure. So, let’s say I had n such files unstruct-1, unstruct2….unstruct-n and I want to store the unique gene names from them into genes-unstruct-1, genes-unstruct-2…genes-unstruct-2. This is what I would do:
for f in ./unstruct*;
do (cut -d "|" -f2 $f | tr -s ";" "\n" | uniq > genes-${f});
done
That’s it! Here’s what happened:
- A for loop was executed over all the files that start with a particular string. The
*
is for all files in the current directory that start with a particular string. - We start with the cut command and tell it to cut the file with
|
as the delimiter. This helps us use-f2
as gene list is now our second column. If instead of|
, it was a tab character, we wouldn’t have required-d
argument and would have used-f4
(for 4th column). $f
is our input file. Since we used “f” as the variable to represent a file in our for loop, we simply add a$
sign in front of it to access that file.- The output from this first cut command goes into another command tr with the help from unix pipe. tr simply transforms your file with rows becoming columns and vice versa. We wanted all our genes to be in a single column instead of them being in row with
;
as a separator. tr has a usefu lparameter-s
that allows us to convert;
into a new line character\n
. So while transforming, it moves the gene into the next line. - Our final command uniq simply removes the duplicates from the gene list.
- And then we throw the output of this whole pipeline into a file that derives it’s name from the original file.
${f}
allows us to use the name of the original file as part of another string. We don’t have to worry about naming our output files differently as this automates the process.
It will depend on the size and the count of your original files but if you haven’t played around with a combo of Unix commands before, you will be happily surprise with the speed.