Jump to content
Sign in to follow this  
FrereGenetics

Digesting specific codes from txt files

Recommended Posts

Maarten Witberg

please note my file is meant as a demo for studying, it is not a full fledged solution. not by a long way. good night....

Share this post


Link to post
Share on other sites
FrereGenetics

i just changed the limiting sequence to global bc it is always the same, now my only problem is I made a new MI actual number field and put it in the portal and it doesn't display all of the found results actual numbers

Share this post


Link to post
Share on other sites
FrereGenetics

btwe you need to tell me what kind of whiskey/cognac/wine you drink because your getting the biggest bottle in the netherlands.

Share this post


Link to post
Share on other sites
Maarten Witberg
to global bc it is always the same,

 

I avoided the global in order to get the parsing calcs to store. Storing calcs will make the file perform better. Use the auto-enter from the global to set the field's value upon import.

 

 

you need to tell me what kind of whiskey

 

oo I have very expensive taste in whiskey. :cool: never mind, it's a two way learning street, always.

Share this post


Link to post
Share on other sites
FrereGenetics

is there any way to get it to search the strings for the removeable code from the right and not the left. its accurate all the way down until you get one T or so and it removes the first T not the last

 

also some of the sequences will match exactly into the catalog even with the primer on the end. is there any way to match those and then remove them from any future consideration in the search?

Share this post


Link to post
Share on other sites
FrereGenetics

i experimented with the -1 instead of 1 in the position function but i never got any change whatsoever

 

I need to count how far into the catalog string the interesting string is...any ideas?

this will be essential when searching the chromosome because it hase millions of letters. and i'm not sure my computer can handle an expanded permutation of an input 1,000,000 letters long

Share this post


Link to post
Share on other sites
Maarten Witberg

Well, you started out with 33 character strings and now you're suddenly introducig million-character strings.

I think I am not going to give any answers anymore untill you give me a more complete picture of what you're building.

 

Also the problems you describe leave me with some guesswork as to what you're doing. I think you hit the limit of the parsing calc, but it might also be something else. You need a CF for the Interesting Sequence parse also, that can handle any-length sequences for the limiting sequence. Do not try to build your solution on my demo, it is what it is, just a demo based on scant info.

 

I can only work from what you give me. I'll have to think about how to search a million-character string. You can index for a relational match up to 400 char strings in a multikey. So you'd obviously have a problem if the string to be parsed is longer than that. a wildcard search would be more appropriate then. but how to implement either completely depends on what you're trying to do. So far you're only giving me very tiny windows on what seems to me to be a very interesting database building challenge, but I think I hit the ceiling of what can I can achieve in a forum thread. IOW my advice to you is hire someone to either give you advice on the thing you're building or to build it for you based on your specs.

Share this post


Link to post
Share on other sites
FrereGenetics

Background Story: We are a genetics lab at a major US university working on small non coding regions of DNA that are linked to the second biggest form of mental retardation in the world, fragile x (downs is first). this research will hopefully lead to an ability to completely prevent this within about ten or so years.

my story: I mostly do tech work and i have some experience in programming. they (the people running the experiment) asked me to sort 33 length strings; each containing a region of these noncoding strands and a primer cap (limiting sequence). i said sure no problem. then they came back and asked me to pull out the interesting sequences in front of the limiting sequence and tell them how deep into a larger catalog of sequences these strings appear. i said that would be a little harder but doable. the programs that do this at the moment are not specific enough for our purposes. after I (we) did that they told me yesterday that the ultmate goal is to be able to find these non coding strands in the mouse genome which includes 19 chromosomes about 1,000,000 bps (base pairs A,T,G,C) long. so i hadn't heard of this either and obviously a permutation that explodes a sequence into 1000's isnt practical for a string over 400 bps long anyway. so while this work is very important, this program is used to aide our data and boost our numbers. our research is not dependant on this program but it could help tremendously.

Share this post


Link to post
Share on other sites
Maarten Witberg

I'll get back to you in about 24hrs.

Share this post


Link to post
Share on other sites
Maarten Witberg
o find these non coding strands in the mouse genome which includes 19 chromosomes about 1,000,000 bps (base pairs A,T,G,C) long

 

so would these non-coding strands occur maybe once in every chromosome or could they appear multiple times?

 

And IIUC this is the procedure:

 

1. you have sample data (range of sequences in the millions, 33 base pairs each - one letter representing one base pair rigt, because C is always coupled with T or was that A :). This is the concern of the original question: mark the limiting sequences and pick out the sequence just before it

 

2. you have a catalog of sequences (range of sequences in the thousands, how many base pairs each?) that you wish to compare against the sample data: in how many bits of the catalog can the sequence be found (and how good is the actual match?

 

3. the catalog of sequences are bits of the mouse genome (19 chromosomes): of the good? matches of the experiment sequences to the catalog, or of all matches found, where exactly do these lie on either of these 19 chromosomes.

 

Is this correct?

 

The problem is how to judge the quality of the limiting sequences found and the quality of the match in the catalog and then probably the quality of the match in the genome, so you have a margin of error that increases with each step, and the purpose of this number crunching operation is to give supportive evidence of the location of the interesting sequences on the mouse chromosome.

 

So that would make three major problems for a database that performs this:

1. how to parse out, at the moment we have proof of principle it can be done

2. how to judge the quality, my vague understanding is the longer the limiting string and the interesting string, the better the quality, but there are tiny faults in many samples. you can't hand-pick from millions of records so you need some statistical work done - but I have no idea what kind.

3. how to perform a matching operation on 1.000.000 character sequences. To the best of my knowledge this can be done by a wildcard search, but I confess not to know how filemaker performs under these circumstances. You need to experiment with this.

 

In my view the database that could do this is structurally not very complex; however it would need a solid way of tracking which sample data is imported. The actual problems are in the sheer numbers involved and how it would affect performance to solve it this way or that.

And like I said, I think some kind of statistical procedure is needed, but whether inside or outside your database I do not know.

 

If all my assumptions are in the right direction then I think my hunch is still correct: you need a pro developer at your side.

Share this post


Link to post
Share on other sites
FrereGenetics

1. close we want to match up these 33 without removing it first and then remove a letter of the primer and search; remove another letter and then search and so on. we need the number of bps into the chromosome (i.e. at 300,000 this strand appears.) This would be ideal and would replace a 3500 euro program from malaysia that is not specific enough for this work. It would also be effective if we removed the primer (the limiting sequence) off the end of these small strings and then searched (far easier.) If we could search with the whole string remove one letter and search again and upon finding a match remove the search-for string from future searches that would be ideal.

Yes this strings can appear more than once per chromosome

(and its A to T and G to C you can remember this because A&T are letters composed of straight lines and C&G go together because they are round letters. Except in RNA where all T's are replaced by U's but this is an easy substitute equation that I've already done)

 

2. These chromosomes have bps in the millions and can be found at http://www.ucsc.edu. the actual match is going to be identical and doing it in the remove letter search, remove letter search format would be more effective than ANY current program that does this kind of sequencing (i.e. best in the world) They can be found in any part of the chromosome unfortunately but this is how relationships that are life saving are disovered... regions previously thought to be non-coding have been revealed to be lethal etc. through these relationships.

 

3. the catalog of smaller sequences that i was given origianlly are parts of the chromosome that are frequently examined in this research but using the entire chromosome would be ideal but its a headache just thinking about it.

 

...

 

1. correct

2. the longer the match the better and i have absolutely no idea what kind of statistical work either. we are both in the same dark space together on that one.

3. see above number one

 

and yea a pro developer would be best but this software isn't essential it would just speed up our work tremendously and budgeting is quite tight in the lab

Share this post


Link to post
Share on other sites
Maarten Witberg

So are you wanting to look into the whole chromosome or just the noncoding

parts? will that limit the length of the strings to be searched? I am looking for a way to chop up the chromosomes.

If you want to do a quick search through the chromosome, and not test millions of sample sequences one by one in a find-discard or keep procedure (going to take years) you need this.

What if you chop it up anyhow in overlapping bits. Say strings of 100-200 bps . so record 1 holds 1-200, record 2, 101-300, record 3, 201-400 and so on. For a chromosome of 1 million bps you'd map it into just 20000 records. Then do the compare and omit in one whack.

Just a wild thought, i'll think about it some more.

 

BTW if you're on a tight budget, declare this one an open source project and be prepared to share your solution. If you really think you can beat existing software that's expensive you're on to something.

Share this post


Link to post
Share on other sites
FrereGenetics

there are repetitive regions that code be choped out but it would be beneficial to search those too. luckily they are in lower case letters in these massive text files and the important ones are in uppercase so im pretty sure upper and lower can take care of those.

but that chopping idea sounds potent that might be a good angle of attack

Share this post


Link to post
Share on other sites
Maarten Witberg

I did a quick test: a 200-character sequence parses into 20100 lines of multikey. the overlap would take care of matches that fall in the starts and ends of each piece of chromosome. if you're searching for strings of up to 33 chars (but usually in the 10 to 20 range) you'd have more than enough overlap.

I expect this to be not a fast solution. If too slow, you could take the chopping bit a step further and assign two or more machines to the task, each taking care of a portion of the chromosome. If you got them sitting around. but computers are cheaper than consultants.

Share this post


Link to post
Share on other sites
Maarten Witberg

If you're still interested... I've worked up a further trial,

 

1. all functions to find candidate coding sequences and cap sequences (hope i get the names right) are now custom functions, which I assume makes them both faster and more robust.

 

2. matching chunks of the chromosome to the candidate codons. I created a piece of bogus chromosome (something like a few thousand bps). This I mapped into chunks of 200 bps, with overlaps of 100 on either side. The multiline parse as above I matched to the candidates. The bogus chromosome resulted in very many hits (20-40 per chunk) which is probably not realistic.

But here's the deal. I had mere 30 records for the chromosome map. The file already bloated to 167MB (indexing the parse) and merely switching from record to record takes 10+ seconds.

So I had a second go, this time with 100 bps map chunks. The speed improved very much (1-4 secs for flipping through records), but filesize is still 117MB for 62 map records.

 

So as I suspected, speed optimization is definetly going to be an issue here, but I also need more realistic chromosome sequence for testing.

If you're still interested, send me an email to witbe001-AT-planet.nl

Share this post


Link to post
Share on other sites
Maarten Witberg

I'm very curious to know how you're doing on this project.

Share this post


Link to post
Share on other sites
FrereGenetics

Its been back-burnered for two weeks, but i'll tell you more on monday

Share this post


Link to post
Share on other sites
Maarten Witberg

okay thanks

Share this post


Link to post
Share on other sites
FrereGenetics

so basically whats happened is we received the latest batch of results and basically its a text document around 2 gigabytes. which needless to say is enormous. like 650,000 lines. so the higher ups decided to move to a more powerful program like perl. which is confusing the hell out of me but another program parse-o-matic may fit the bill for what we're looking for

Share this post


Link to post
Share on other sites
Maarten Witberg

thanks for the update. I have no knowledge of perl so I can't say if this is a good move or not. I do however still have that demo file sitting around doing nothing. You could decide to test its worth.

All you need is processor time for it. Which should be cheap. Cheaper than a perl developer anyhow :)

Share this post


Link to post
Share on other sites
This thread is quite old. Please start a new thread rather than reviving this one.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  



×
×
  • Create New...

Important Information

Terms of Use