Jump to content
Sign in to follow this  
FrereGenetics

Digesting specific codes from txt files

Recommended Posts

FrereGenetics

So heres my problem:

 

Our program spits out 60,000 lines of information at a time.

The information we need is laid out like this:

622186 TGAGGTAGTAGGTTGTATGGTTTCGTATGCCGT

160876 TGAGGTAGTAGATTGTATAGTTTCGTATGCCGT

118615 TGAGGTAGTAGGTTGTGTGGTTTCGTATGCCGT

115231 TCTACAGTCCGACGATCTCGTATGCCGTCTTCT

95367 TTCTACAGTCCGACGATCTCGTATGCCGTCTTC

83418 TGAGGTAGTAGGTTGTATAGTTTCGTATGCCGT

71572 CTACAGTCCGACGATCTCGTATGCCGTCTTCTG

The string of letters (TGCA's) that we need come before a certain string that has anywhere from 3-21 letters. (it always starts TCGTATGCC... for reference but is always the same up to 21)

 

Easy Part: I need to find a way to enter in the number on the left as a serial ID for each output in a record in FMP and the 18-31 letter long sequence that comes after the space but BEFORE the 3-21 letter sequence.

Because the limiting sequence always is the same I know FMP can do this with the Left function i believe.

 

Also the total letter sequence is always 32 letters long which varies the length of the limiting sequence on the end

 

Heres the tricky part: I need to filter out the numbers I have already extracted from the txt. because the ones to the left of all 21 letters are most valuable then 20 then 19 and so on.

I think this can be done with 21 separate scripts that each look for a line with one less letter from the sequence down to a 3 letter sequence (so actually 18 scripts but you get the picture)

 

Does anyone know the left command that would do this? (I could figure this out the easiest but if someone knew it off the top of their head that would be easier)

 

And more importantly does anyone know how to ignore the inputs that have been previously entered during longer limiting sequence entries? (So when I get down to say looking to the left of a 3 letter sequence it doesn't include the previously searched strands that have a high likely hood of having more than one of that specific three letter strand.

Share this post


Link to post
Share on other sites
AHunter3

I can do text-parsing practically in my sleep. I'm having more problems parsing your post (sorry).

 

You've got this:

622186 TGAGGTAGTAGGTTGTATGGTTTCGTATGCCGT

160876 TGAGGTAGTAGATTGTATAGTTTCGTATGCCGT

118615 TGAGGTAGTAGGTTGTGTGGTTTCGTATGCCGT

115231 TCTACAGTCCGACGATCTCGTATGCCGTCTTCT

95367 TTCTACAGTCCGACGATCTCGTATGCCGTCTTC

83418 TGAGGTAGTAGGTTGTATAGTTTCGTATGCCGT

71572 CTACAGTCCGACGATCTCGTATGCCGTCTTCTG

 

You want WHAT?

 

Is the above the information that will feed a single record, or will it be parsed into several FileMaker records?

 

Can you say this part again as clearly as possible?:

and the 18-31 letter long sequence that comes after the space but BEFORE the 3-21 letter sequence

 

I don't see any place in your sample data where there is a sequence lying between a space and a 3-21 letter sequence that follows it. :confused:

 

If you would give some specific examples of what your desired result would look iike, that might clarify matters.

 

 

I can almost guarantee you that you will not need 21 scripts, or a single script with 21 sub-parts, but more likely a Loop / End Loop that walks through your input and parses different outputs from it until done.

 

I strongly suspect you want to be using Middle () not Left () for at least some of this (you've got values BETWEEN other values not merely to the LEFT of a cutoff point, yes?).

 

I am on very good terms with Middle, Position, and PatternCount and I can probably help you as soon as I can figure out what it is that you're trying to do here.

Share this post


Link to post
Share on other sites
FrereGenetics

Okay so what I've got so far is one record per sequence

I pulled the number out of the line as the ID number and I am trying to pull out the text sequence but setfield (uglytexy, getastext (uglytext)) isnt working and neither is setfield (uglytext, upper (ugly test)) they both return the sequence with the numbers and the letters, i just want the letters.

 

So each of our desired sequences ends with this sequence TCGTAGCCGTCTTCTGCTTGT. However because each line can only output 32 spaces the end of that limiting sequence gets chopped off. Also the sequence length before those letters (desired) also varies in length

 

per example:

118615 TGAGGTAGTAGGTTGTGTGGTTTCGTATGCCGT

115231 TCTACAGTCCGACGATCTCGTATGCCGTCTTCT

in these two we want to cut out these

118615 TGAGGTAGTAGGTTGTGTGGTTTCG(TATGCCGT)

115231 TCTACAGTCCGACGATCTCG(TATGCCGTCTTCT)

and keep the original sequences

 

because the desired sequence can be anywhere from 18-30 letters long the end sequence can be as short as two letters in length and as long as 21. BUT the problem with filtering out say only everything to the left of the first 6 letters of the sequence is that its less accurate than using 10 or 15 or 21... so if i could organize it so that the ones who were filtered out with the largest portion of the limiting sequence came first then that would greatly improve our accuracy.

If we were just filtering out three letters at a time there is a large chance that that three letter sequence would appear in our desired strain as well. so if we could eliminate from the search the lines that have already been filtered to retain their superior accuracy that would be sublime.

 

does that make more sense?

Share this post


Link to post
Share on other sites
Norma_Snockurs
I can probably help you as soon as I can figure out what it is that you're trying to do here.

 

Good luck with that AH!smiley-wink I suspect the method to achieve the desired result will be easy. However to get to that requires understanding what the hell this poster is talking about. I have read the posts several times but clarity is definitely not forthcoming. Perhaps a step by step breakdown in a list format of what goes in and what comes out might help us get our head around this problem - it's making no sense to me at the moment.

Share this post


Link to post
Share on other sites
Maarten Witberg

well here's a first stab. What I understand is 60.000 records come with sequences of bits of dna. the bits are 33 characters long. some sequences are interesting, and they end with a marker string. However, the marker string may only appear in part. If it is an interesting sequence, the marker must be chopped off. If not, it is to be ignored, or discarded.

Testing and parsing could be done best using one or two custom functions, especially if there are marker strings of various length.

But the OP has not got the developer edition, so here's how I'd do it for a string that is max. 13 characters long. It would occur to me that less precision (i.e. a higher chance of a false positive) is obtained the shorter the marker sequence is made. nevertheless, here's something to play with. If you need a better approach, post more data with and without (parts of) various markers. I guess you're going to be better of with a custom function.

 

Left 
(
Sequence ; 
Case 
( 
  Position ( Sequence ; gMarker ; 1 ; 1 ) ;  Position ( Sequence ; gMarker ; 1 ; 1 )  ; 
  Position ( Sequence ; Left (gMarker; Length(gMarker) -1  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -1  ); 1 ; 1 )  ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -2  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -2  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -3  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -3  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -4  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -4  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -5  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -5  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -6  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -6  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -7  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -7  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -8  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -8  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -9  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -9  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -10  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -10  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -11  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -11  ) ; 1 ; 1 ) ;
  Position ( Sequence ; Left (gMarker; Length(gMarker) -12  ) ; 1 ; 1 ) ; Position ( Sequence ; Left (gMarker; Length(gMarker) -12  ) ; 1 ; 1 ) 
  // and so on
)
- 1
)

I am assuming the serial number is already removed using Sequence = RightWords (InputString ; 1)

Share this post


Link to post
Share on other sites
FrereGenetics

Excellent!

Thats it!

But the desired code that is extracted that comes before more of the limiting code (i.e. it has 10 matching characters rather than only 3) is much more desirable. is there any way that after each filtration (21, 20 ,19) they could be put into their own separate databases (or records)

Share this post


Link to post
Share on other sites
FrereGenetics

yea i already extracted the numbers and letters using filter and getasnumber.

 

if you want more data i have millions

 

622186 TGAGGTAGTAGGTTGTATGGTTTCGTATGCCGT

160876 TGAGGTAGTAGATTGTATAGTTTCGTATGCCGT

118615 TGAGGTAGTAGGTTGTGTGGTTTCGTATGCCGT

115231 TCTACAGTCCGACGATCTCGTATGCCGTCTTCT

95367 TTCTACAGTCCGACGATCTCGTATGCCGTCTTC

83418 TGAGGTAGTAGGTTGTATAGTTTCGTATGCCGT

71572 CTACAGTCCGACGATCTCGTATGCCGTCTTCTG

55939 TGAGGTAGTAGGTTGTATGGTTTCGTATTCCGT

50625 TGAGGTAGTAGGTTGTATGGTTCGTATGCCGTC

37978 TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA

30660 TGAGGTAGGAGGTTGTATAGTTTCGTATGCCGT

27953 TGAGGTAGTAGGTTGTATGGTTTCGTATGTCGT

27722 AGAGGTAGTAGGTTGCATAGTTTCGTATGCCGT

27609 AGTTCTACAGTCCGACGATCTCGTATGCCGTCT

25457 TGAGGTAGTAGGTTGTATGGTTTTCGTATGCCG

23354 TGAGGTAGTAGATTGTATAGTTTCGTATTCCGT

19988 TGAGGTAGTAGATTGTATAGTTCGTATGCCGTC

17959 TGAGGTAGTAGGTTGTGTGGTTTTCGTATGCCG

17637 TACAGTCCGACGATCTCGTATGCCGTCTTCTGC

17104 TGAGGTAGTAGGTTGTGTGGTTCGTATGCCGTC

15165 TGAGGTAGTAGGTTGTATGGTTATCGTATGCCG

13406 TGAGGTAGTAGTTTGTACAGTTTCGTATGCCGT

11748 TGAGGTAGTAGGTTGTATAGTTCGTATGCCGTC

10764 GTTCTACAGTCCGACGATCTCGTATGCCGTCTT

9863 TGAGGTAGTAGGTTGTGTGGTTTCGTATTCCGT

9605 GAGGTAGTAGGTTGTATGGTTTCGTATGCCGTC

9294 TGAGGTAGTAGGTTGTATGGTTTCGTATGCCTT

8925 TGAGGTAGTAGGTTGTATGGTTTCGTATGCTGT

8711 ACAGTCCGACGATCTCGTATGCCGTCTTCTGCT

8487 CGCGACCTCAGATCAGACTCGTATGCCGTCTTC

8484 TGAGGTAGTAGGTTGTATAGTTTCGTATTCCGT

8224 TCAGAGTTCTACAGTCCGACGATCTCGTATGCC

7576 TGAGGTAGTAGGTTGTATGGTTTCTTATGCCGT

7134 TGAGGTAGTAGGTTGTATGGTCGTATGCCGTCT

7009 AGCAGCATTGTACAGGGCTATTCGTATGCCGTC

6999 CGCGACCTCAGATCAGACGTCGTATGCCGTCTT

6897 TGAGGTAGTAGGTTGTGTGGTTTCGTATGTCGT

6541 TGAGGTAGTAGGTTGTATGGTTCGTATGCCGTT

6119 TGAGGTAGTAGGTTGTGTGGTTATCGTATGCCG

5911 TGAGGTAGTAGGTTGTGTGGTCGTATGCCGTCT

5828 TGAGGTAGTAGATTGTATAGTTTCGTATGTCGT

5621 CTACAGTCCGACGATCTCGTATGCCGTCTTCTT

5601 AGCAGCATTGTACAGGGCTATGATCGTATGCCG

5308 TGAGGTAGGAGGTTGTATAGTTCGTATGCCGTC

5134 TCCCTGAGACCCTAACTTGTGATCGTATGCCGT

4944 TGAGGTAGTAGATTGTATAGTTATCGTATGCCG

4558 GAGTTCTACAGTCCGACGATCTCGTATGCCGTC

4395 TGAGGTAGGAGGTTGTATAGTTTTCGTATGCCG

4277 TGAGGTAGTAGGTTGTATAGTTTTCGTATGCCG

4275 CGCGACCTCAGATCAGACGTGTCGTATGCCGTC

4229 TGAGGTAGTAGATTGTATAGTTTTCGTATGCCG

4025 TTCTACAGTCCGACGATCTCGTATGCCGTCTTT

3958 TGAGGTAGTAGGTTGTATGGTTTCGTATGGCGT

3713 AAAAGCTGGGTTGAGAGGGCGATCGTATGCCGT

3581 CGCGACCTCAGATCAGACGTTCGTATGCCGTCT

3490 TGAGGTAGTAGGTTGTATGGTTTTCGTATTCCG

3249 TGAGGTAGGAGGTTGTATGGTTTCGTATGCCGT

3238 TATTGCACTCGTCCCGGCCTCCTCGTATGCCGT

3217 TGAGGTAGTAGGTTGTATAGTTATCGTATGCCG

3000 TGAGGTAGTAGGTTGTATGGTTTCGTTTGCCGT

2960 TGAGGTAGTAGGTTGTGTGGTTCGTATGCCGTT

2917 TGAGGTAGTAGATTGTATAGTTTCGTATGCCTT

2871 GACGATCTCGTATGCCGTCTTCTGCTTGAAAAA

2812 TGAGGTAGTAGATTGTATAGTCGTATGCCGTCT

2797 TCTTTGGTTATCTAGCTGTATGATCGTATGCCG

2782 TGAGGTAGTAGGTTGTATGGTTATCGTATTCCG

2732 CAGTCCGACGATCTCGTATGCCGTCTTCTGCTT

2729 TGAGGTAGTAGGTTGTATAGTTTCGTATGTCGT

2727 CAGAGTTCTACAGTCCGACGATCTCGTATGCCG

2672 AACCCGTAGATCCGATCTTGTGTCGTATGCCGT

2629 AAGGTAGATAGAACAGGTCTTGTCGTATGCCGT

2617 TTCAAGTAATCCAGGATAGGCTTCGTATGCCGT

2589 TGAGGTAGTAGGTTGTGTGGTTTCGTATGCTGT

2563 TGAGGTAGTAGGTTGTATGGTTTTGTATGCCGT

2555 TGAGGTAGTAGATTGTATAGTTTCTTATGCCGT

2533 TGAGGTAGTAGGTTGTGTGTCGTATGCCGTCTT

2468 GGCAGAGGAGGGCTGTTCTTTCGTATGCCGTCT

2467 TGAGGTAGGAGGTTGTATAGTTTCGTATTCCGT

2441 TGAGGTAGGAGGTTGTATAGTTATCGTATGCCG

2358 TCTTTGGTTATCTAGCTGTATGTCGTATGCCGT

2310 TGAGGTAGTAGGTTGTGTGGTTTTCGTATTCCG

2265 AGAGGTAGTAGGTTGCATAGTTCGTATGCCGTC

2134 GGATTTTTGGAAGTAGGAGTCGTATGCCGTCTT

2131 CGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA

2079 TGAGGTAGGAGGTTGTATAGTTGTCGTATGCCG

2062 TGGATTTTTGGAAGTAGGAGTCGTATGCCGTCT

2032 TCTTTGGTTATCTAGCTGTATTCGTATGCCGTC

1976 TGAGGTAGTAGGTTGTATGGTTTCGTATTCCTT

1974 TGAGGTAGTAGGTTGTATGGTATCGTATGCCGT

1959 TGAGGTAGTAGGTTGTGTGGTTTCGTATGCCTT

1945 AGAGTTCTACAGTCCGACGATCTCGTATGCCGT

1923 TGAGGTAGTAGGTTGTATGGTTATCGTATGTCG

1886 TGAGGTAGTAGGTTGTATGGTTTCTTATTCCGT

1862 AGCAGCATTGTACAGGGCTATGTCGTATGCCGT

1827 TCACAGTGAACCGGTCTCTTTTCGTATGCCGTC

1818 TGAGGTAGTAGATTGTATAGTTCGTATGCCGTT

1791 TGAGGTAGTAGGTTGTATGTTTTCGTATGCCGT

1787 TGAGGTAGTAGTTTGTGCTGTTTCGTATGCCGT

 

and the limiting sequence is TCGTATGCCGTCTTCTGCTTGT

Share this post


Link to post
Share on other sites
AHunter3
So each of our desired sequences ends with this sequence TCGTAGCCGTCTTCTGCTTGT

 

Sorry, you did say that, and I missed it.

 

So you want to parse out the piece of your initial textstring that starts with a space (the first space, right?) and ends with TCGTAGCCGTCTTCTGCTTGT?

 

(This is DNA, isn't it?)

 

Set Variable [$EndSeq, "TCGTAGCCGTCTTCTGCTTGT"]

Comment ["That just saves us from having to type out that mess each time."]

Set Field [YourTable::DestinationField, Middle(UglyTextField, Position(UglyTextField, " ", 1, 1)+1, Position(UglyTextField, $EndSeq, 1, 1)+Length($EndSeq) - Position (UglyTextField, " ", 1, 1)-1]

 

that ought to work.

 

 

EDIT: well, that's what happens when you get a long phone call and don't refresh your screen before finishing your post. Ah well, good practice for me at any rate.

Share this post


Link to post
Share on other sites
FrereGenetics

Set Variable [$EndSeq, "TCGTAGCCGTCTTCTGCTTGT"]

Comment ["That just saves us from having to type out that mess each time."]

Set Field [YourTable::DestinationField, Middle(UglyTextField, Position(UglyTextField, " ", 1, 1)+1, Position(UglyTextField, $EndSeq, 1, 1)+Length($EndSeq) - Position (UglyTextField, " ", 1, 1)-1]

 

 

yes its dna and i already extracted the letters only to their own field so middle and " " arn't necessary but thank you.

 

as for the next setfield [destfield, left(extractfrommefield, ....?

 

and is there a way to priortize my extracted fields so that the ones that use the longest limiting sequence come first

Share this post


Link to post
Share on other sites
Maarten Witberg
and is there a way to priortize my extracted fields so that the ones that use the longest limiting sequence come first

 

If you're using my calc, sort by Length (ParsedString). You can omit the ones where length=0 first.

Share this post


Link to post
Share on other sites
FrereGenetics

thank you so much that worked perfectly 90,000 strings in a minute, that used to take us weeks

Share this post


Link to post
Share on other sites
Maarten Witberg

are you sure? please check the result!

Share this post


Link to post
Share on other sites
Maarten Witberg

I'm pretty sure at least my last advice was erroneous. Please see sample. Apart from going crosseyed, I noticed that you can have partial marker strings in the middle of the sequence. So it is not necessarily so that the shortest parsed string has the best match. In fact it's quite silly to assume this. So here's a method that color codes the marker string or the part of the marker string it finds to match. I sorted by the length of the marker string found. The calcs are really ugly. see if this is better.

 

http://home.planet.nl/~witbe001/ParseSequences.fp7.zip

Share this post


Link to post
Share on other sites
FrereGenetics

everything is working great except the sort feature.

 

so how do i write the script for sort by length(parsedstring)?

Share this post


Link to post
Share on other sites
Maarten Witberg

Well my sample has a table view; just click on the column header you want to sort (or click again to reverse the sort).

Otherwise create a field

 

LengthString = Length(ParsedString)

 

script

sort [ no dialog ;  restore ] # by LengthString 

I would like to understand a little bit about what exactly you are doing. Because the current calc is kind of tailor made for the current limiting sequence. And it doggedly keeps going untill it finds even one letter that matches which IIUC is not useful.

 

So sorting by ParsedString currently leaves you with strings that have been found before a match of two or three characters from the limiting sequence. Isn't that a false positive?

 

no match:                                    [b][color=Blue]T[/color][/b]GAGGTAGTAGGTTGTATGGTTTTGTATGCCGT
not good:              TGAGGTAGTAGGTTGTATGGTT[b][color=Blue]TCGT[/color][/b]TTGCCGT
pretty good:           AACCCGTAGATCCGATCTTGTG[b][color=Blue]TCGTATGCCGT[/color][/b]
very good:                            GACGATC[b][color=Blue]TCGTATGCCGTCTTCTGCTTG[/color][/b]AAAAA

Also, will you be using other limiting strings that are much longer or shorter? You have the raw data which is 33 characters long, but is this always the case? could it be 300 characters at some point?

 

I think you should have a custom function that can recursively handle limiting strings that are (much) longer or shorter, and you should be able to set the minimum number of matching characters accoording. I could probably come up with one, but like I said, you have not got Filemaker Developer so you can't put the custom function into your solution yourself. If you're interested, please post again.

Share this post


Link to post
Share on other sites
FrereGenetics

Yes the output is always 33 letters long.

 

And while it is good to avoid false positives sometimes the primer sequence mutates a tiny bit during the reaction that creeates these or sometimes the desired strain isn't produced at all and these are things that if they appeared in large numbers it could tell us that we need to change our experiment

 

Also my scripts do not have the option to sort [ no dialog ; restore ] # by LengthString . i created the field lengthstring but in my scripts i can only sort by ascending order etc...

Share this post


Link to post
Share on other sites
FrereGenetics

and for our current experiment that is the only limiting sequence but i do have it set right now to an equation that would make it incredibly simple to change the limiting cap

Share this post


Link to post
Share on other sites
Maarten Witberg

you can specify the sort options when selecting the sort script step.

Share this post


Link to post
Share on other sites
FrereGenetics

i hate to sound like a n00b here but when i create a script sort filtered string I see no options. all i see is sort by asc order etc...

heres my script

go to field [notes::filtered string]

sort records [restore; no dialog]

 

and when i click specify on sort records i select by lengthstring by descending

 

lengthstring field:

length string number auto-enter by calc (textlength=length(filtered string))

 

and still no sorting appears?

Share this post


Link to post
Share on other sites
Maarten Witberg
sort filtered string

what do you mean?

Share this post


Link to post
Share on other sites
FrereGenetics

thats the name of my output i want sorted "filtered string" i'm having trouble getting the length feature to work.

 

but i have a new problem:

i need to match the orignal string and the extracted string against a new database and have filemaker return to me which MI number it matched to...

this text file is set up like this:

>mmu-let-7a-1 MI0000556

UUCACUGUGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUAAGGUGAU

>mmu-let-7a-2 MI0000557

CUGCAUGUUCCCAGGUUGAGGUAGUAGGUUGUAUAGUUUAGAGUUACAUCAAGGGAGAUAACUGUACAGCCUCCUAGCUUUCCUUGGGACUUGCAC

however because there is an enter between the number and the string it imports this data into filemaker as four records and not two. is there any way to alternate input lines as different fields in the same record...?

Share this post


Link to post
Share on other sites
FrereGenetics

i dont know why there are spaces in my codes there arn't any in my .txt

Share this post


Link to post
Share on other sites
Norma_Snockurs
i dont know why there are spaces in my codes there arn't any in my .txt

 

There is a maximum word length limit of 50 characters in forum posts. Use the 'CODE' tags (click the Hash symbol when composing posts) to enclose your text and your strings will display as they should.

Share this post


Link to post
Share on other sites
Maarten Witberg

I guess the simplest way is to prep the import file (i'm using the MS word replace function):

 

1. replace ^p> with XXXXX

2. replace ^p with ^t

3. replace XXXXX with ^p>

 

save as .txt, you now have a tab delimited import file.

 

http://home.planet.nl/~witbe001/ParseSequencesSort.fp7.zip

 

here's a modified sample that shows how to sort by various fields ascending or descending. If this is not what you're after, then you need to be more precise in your description; a description of your database - or a sample of it - may help too. You are quite short in your descriptions, making us guess...

Share this post


Link to post
Share on other sites
FrereGenetics

>mmu-let-7a-1 MI0000556
UUCACUGUGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUAAGGUGAU
>mmu-let-7a-2 MI0000557
CUGCAUGUUCCCAGGUUGAGGUAGUAGGUUGUAUAGUUUAGAGUUACAUCAAGGGAGAUAACUGUACAGCCUCCUAGCUUUCCUUGGGACUUGCAC
>mmu-let-7b MI0000558
GCAGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCCCCUCCGAAGAUAACUAUACAACCUACUGCCUUCCCUGA
>mmu-let-7c-1 MI0000559
UGUGUGCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCUGGGAGUUAACUGUACAACCUUCUAGCUUUCCUUGGAGCACACU

 

i'm not sure what i would find or replace in this dataset because the MI number is associated with the following sequence and i don't know how to edit out every other "enter"

Share this post


Link to post
Share on other sites
FrereGenetics

is ther a way to use the If command to tell it to move any record not containg numbers to move to a new field on the previous record.?

or to move every record containg numbers to a new field on the following record?smiley-undecided

Share this post


Link to post
Share on other sites
Maarten Witberg

take a good look: between two intended records, there is a paragraph mark and a > sign. So replace these pairs with a placeholder such as XXXXX and then you can replace the rest of the paragraph marks with a tab. then put the paragraph mark and the > back in place of the XXXXX. I suggested as much in ms word coding for tab (^t) and paragraph mark (^p)

Share this post


Link to post
Share on other sites
FrereGenetics

k i took a different approach that requires little user error in setting up this database I created a loop that copies and pastes the next page to the last pages field

 

however my problem now is that I have a large strain of RNA and i need a strain of DNA. that is done by replacing all the "U"s with "T"s and replace is not really working with me right now moreso against me...

 

replace ("U", 1, 1, "T") erases the whole thing...

 

>>??

Share this post


Link to post
Share on other sites
FrereGenetics

this could be easily done in txt or word but id prefer to eliminate any user prep whatsoever and have FMP do everything for them

Share this post


Link to post
Share on other sites
Maarten Witberg

Substitute ( Sequence ; "U" ; "T" )

 

genetic engineering made easy :cool:

Share this post


Link to post
Share on other sites
This thread is quite old. Please start a new thread rather than reviving this one.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  



×
×
  • Create New...

Important Information

Terms of Use