Atul's Perl Scripts

Here are some Perl utilities for bioinformatics folks at CHIP, for use in Linux, MacOS X, or Windows under Cygwin (untested). All require Perl and the DBI::MySql library installed (the latter allows Perl to communicate directly with MySql servers). Both of these are probably already installed under your setup. Use CPAN (type "man cpan" to learn more) to learn how to install the DBI package, if you don't already have them. Please don't distribute these scripts outside of CHIP.


The first, make_go_report.pl finds the GO categories for significant genes of your choosing. The input file has to look something like:

92580_at        basal
92589_at        basal
92592_at        basal
92593_at        basal
92615_at        basal
162358_i_at     tg_basal
162459_f_at     tg_basal
92581_at        tg_dex
92582_at        tg_dex
92587_at        tg_dex
92821_at        tg_dex
93011_at        tg_dex

with two columns, with a single tab in-between. Column one is an affy accession number, and column two is YOUR OWN symbol of what makes this gene significant. For example, in the file above, 92615_at was high in my basal chips, so I labeled it with "basal". It's ok to have a gene in two rows with different labels (e.g. the same probe-set can be in the "basal" group in one row, and the "tg_dex" group in the next row).

To run the command, while in the appropriate directory as this perl script, type

./make_go.pl --array u74v2_a --version 4 < input_file.txt

Use the same symbols for the arrays as unCHIP uses. in addition, you can use something like "--array u74%" where the percent-sign is a wildcard. For version, use 4, as this is the latest edition of the unCHIP tables.

The output looks like this:

BIOLOGICAL PROCESS

** GO:0000002 mitochondrial genome maintenance
** 3 / 4
        100040_at@u74v2_a       Mrpl17; Rpml26; MRP-L26; mitochondrial ribosomal protein L17    basal
        100040_at@u74v2_a       Mrpl17; Rpml26; MRP-L26; mitochondrial ribosomal protein L17    tg
        100040_at@u74v2_a       Mrpl17; Rpml26; MRP-L26; mitochondrial ribosomal protein L17    tg_basal
        93062_at@u74v2_a        Mrpl39; ORF22; Rpml5; MRP-L5; C21orf8; Pred22-pending; mitochondrial ribosomal protein L39      tg
        96296_at@u74v2_a        Mrpl15; Rpml7; MRP-L7; HSPC145; mitochondrial ribosomal protein L15     tg

First, the BIOLOGICAL PROCESSes are listed, then the MOLECULAR FUNCTIONS. The GO id number and name are listed. "3 / 4" means that there are three unique probe-sets implicated from this GO-category in any of your groups (in the above example, 100040_at, 93062_at, and 96296_at), and that there were a total of 4 POSSIBLE probe-sets that could have been implicated, taking into account the composition of the arrays you used (subtracting those probe-sets that match multiple genes; a small detail). Each of the genes is listed first by accession, then a tab, then the symbols for this gene, then a tab, then your symbol for the group this gene was implicated in. This file should be directly loadable into Excel.

The second script, make_go_table.pl takes files in the same exact format and is run the same way, with the same parameters, except returns something like:

molecular function	non_tg	dex	tg	tg_dex	non_tg_dex	basal	non_tg_basal
tg_basal	sum
'0003953'	'NAD(+) nucleosidase'	'0'	'1'	'0'	'0'	'0'	'0'	'0'	'0'	'1'
'0004385'	'guanylate kinase'	'0'	'0'	'1'	'0'	'0'	'0'	'0'	'0'	'1'
'0016836'	'hydro-lyase'	'0'	'0'	'0'	'0'	'0'	'1'	'0'	'0'	'1'
'0008378'	'galactosyltransferase'	'1'	'0'	'0'	'0'	'0'	'0'	'0'	'0'	'1'
...
'0005524'	'ATP binding'	'53'	'57'	'81'	'7'	'4'	'87'	'3'	'29'	'321'
'0003677'	'DNA binding'	'53'	'115'	'95'	'5'	'7'	'57'	'8'	'12'	'352'
'0016740'	'transferase'	'52'	'58'	'115'	'12'	'3'	'99'	'2'	'41'	'382'
588 rows


biological process	non_tg	dex	tg	tg_dex	non_tg_dex	basal	non_tg_basal
tg_basal	sum
'0006561'	'proline biosynthesis'	'1'	'0'	'0'	'0'	'0'	'0'	'0'	'0'	'1'
'0006919'	'caspase activation'	'0'	'0'	'0'	'0'	'0'	'1'	'0'	'0'	'1'
'0008213'	'protein amino acid alkylation'	'0'	'0'	'0'	'0'	'0'	'1'	'0'	'0'
'1'
...
'0006810'	'transport'	'7'	'29'	'64'	'7'	'0'	'49'	'2'	'19'	'177'
'0006355'	'transcription regulation'	'47'	'88'	'86'	'5'	'4'	'50'	'5'	'9'	'294'
433 rows

This is tough to explain, but it's required to be able to create those pretty gene-ontologram like pictures in Excel. IMPORTANT: BEFORE opening this file in Excel, you must search for all apostrophe characters in this output file and replace with nothing (i.e. get rid of the apostrophes). In the file, first comes molecular function, then biological process. The first column is the GO code, and the second is the GO term. Then, you will have a column for every symbol OF YOURS in your input file (remember "basal", etc., from above?). The list is sorted in descending order, as is proper for Excel.

The numbers below each column indicate how many genes were in YOUR input list, associated with each of YOUR symbols, in each GO category. From the above example, there were 7 genes in my list associated with "non_tg" (i.e. "non_tg" was listed in the second column of my input file for those 7 genes), and these seven happened to belong to "transport GO:0006810".

To make this into a graph, in Excel, select the GO term-name in the BOTTOM row, then drag to the right to get all the columns, then drag up as many rows as you want (typically you want the bottom 10-15 rows, as the rest of the GO categories only have 1 or 2 genes in them). Then, still in Excel, click on the Chart Wizard, select Stacked Column, click Next, select Series in Rows, then click Finish. Make the chart bigger, then select a smaller font size to see as many categories as you can. As you move your mouse over each square, a little tool-tip will label the category for you. This description doesn't give this enough justice. See the attached file temp_go.gif for an example of what this looks like when you're done.


which_chip.pl --column <col> <inputfile1> <inputfile2> ...

<col> means the tab-delimited column holding the relevant accession number, so if the second column in your file has the probe accession numbers, put 2.

This script outputs two items separated by a tab, similar to

25      Affymetrix u95_a
where the first item is a numeric identifier that YOU WILL NEED for the remaining scripts in this e-mail, and the second item is a human readable explanation of which chip is being used. Mind you, this is no longer Affymetrix specific: Clontech filters, Agilent arrays, and plain Genbank and Unigene identifiers are also supported. In total, there are currently 81 identifier name-spaces installed.

filter_ambiguous_acc.pl --column <col> --array <id> <inputfile1> <inputfile2> ...

<col> again means the tab-delimited column holding the relevant accession number, so if the second column in your file has the probe accession numbers, put 2. <id> holds the array identifier output from the which_chip script above (e.g. "25").

This script outputs all the rows from all the input files, EXCEPT for those rows that list probe sets that may point to more than one locuslink identifier. The best example I can give you is probe set "31350_at", which references a piece of DNA that contains genes with locuslink identifiers 26532 and 26538. There are sometimes ERRORS in these references, but I usually use a routine like this to eliminate the ambigous 5% of probe sets.

filter_non_ll.pl --column <col> --array <id> <inputfile1> <inputfile2> ...

<col> again means the tab-delimited column holding the relevant accession number, so if the second column in your file has the probe accession numbers, put 2. <id> holds the array identifier output from the which_chip script above (e.g. "25").

This script outputs all the rows from all the input files, EXCEPT for those rows for which a locuslink identifier cannot be found. An example of this is "31351_at", which references GenBank X63966. This sequence does match to unigene cluster Hs.135631, but this is not (yet) linked to a locuslink entry, and thus would not be output. There are sometimes ERRORS in these references, but as time goes on, these will improve.

lookup_accession.pl --column <col> --array <id> <inputfile1> <inputfile2> ...

<col> again means the tab-delimited column holding the relevant accession number, so if the second column in your file has the probe accession numbers, put 2. <id> holds the array identifier output from the which_chip script above (e.g. "25").

This script outputs each row from all the input files, and adds a tab-separated column at the start of each line containing the locuslink identifier for that probe set. If more than one is found, they are separated by semicolons.

locus_to_symb.pl --column <col> <inputfile1> <inputfile2> ...

<col> NOW means the tab-delimited column holding the relevant LOCUSLINK identifier, so if the second column in your file has the locuslink identifiers, put 2.

This script outputs each row from all the input files, and adds a tab-separated column at the start of each line containing the gene symbol for that locuslink identifier. If the locuslink identifier has more than one number separated by semicolons, then the resultant symbols are also separated by semicolons. Official symbols are used preferentially, then alias and "preferred" symbols.

locus_to_name.pl --column <col> <inputfile1> <inputfile2> ...

<col> NOW means the tab-delimited column holding the relevant LOCUSLINK identifier, so if the second column in your file has the locuslink identifiers, put 2.

This script outputs each row from all the input files, and adds a tab-separated column at the start of each line containing the gene name for that locuslink identifier. If the locuslink identifier has more than one number separated by semicolons, then the resultant names are also separated by semicolons. Official names are used preferentially, then alias and "preferred" names.