0

I am writing code to retrieve specific characters in a text file by position. For example, I want the sequence of characters between positions of 1043-1049 out of a text, like:

.........acddex.............

...and so on. I want that "acddex" sequence out of that text. I know its order and position. So far I can only open the file and input the position I want, but i have no idea how to numerate the order of whole text, harder yet, the whole file is combination of samples, so I also have to set a repeat/refresh of character count between specific characters of ">", it is like:

agoejngodgfjnsodjnfvsojdnvodfjnodjnfbodjngodjgndojgndlkfnvldfkngldjnfgdfjgnldjfngldjfngldfjngldjfngldjnfg dkjdnfgkjdnfgkjndfkgjndfjgnojfgnlfjngdljfngldjfng kdfjngkdfjngkjdndksjngskfjgndkfjgn

So I need the sequences out of these samples, which are in same file, when I know where the needed sequences start. So how can I do this?

Note: It is not a short sequence, at around 200,000 chars, and I want it to report me the chars between 1046th-1052th positions, for example.

skrrgwasme
  • 9,358
  • 11
  • 54
  • 84
Lindy
  • 11
  • 5
  • 1
    Are you looking for syntax like `"abcdefg"[2:4]` or do you want to know how to read a file? – Matthias Jun 28 '16 at 21:01
  • Is the file small enough that you could just read it into a character string? Then you could just address the string slices you want. – Prune Jun 28 '16 at 21:02
  • the thing is there is around 200,000 characters in my data, and what i want from the code is, for example; print (between 1046th-1052th characters) – Lindy Jun 28 '16 at 21:09

2 Answers2

1

Seek to the byte position of the start of the sequence you want, then call read and tell it how many bytes you want.

Example:

starting_position = XXX # replace XXX with the starting position of your 
                        # desired string
read_length = YYY # replace YYY with how many characters you want to read

with open("filename.txt") as f:
    f.seek(starting_position)
    st = f.read(read_length)

# st now has your characters

Note: this answer assumes the file is either ASCII encoded, or uses some other encoding where each character is only one byte in the file.

If you're extracting a lot of sequences, try to get them in sequential order before you start seeking, so that you're not jumping around the file. After you get it working, consider profiling your code using mmap on the file instead of a normal open. You may see some speedup. (But as with all optimization - make sure you profile first and see if this section of your code really is the part that needs optimizing!)

Community
  • 1
  • 1
skrrgwasme
  • 9,358
  • 11
  • 54
  • 84
  • Thank you this was really helpful, and what if my positions are starting after a specific character like line break, for example; ">zethga00778ndskasdhgfb5677 aakhsbfkajef12938mas987124mn......" the count is actually starting from aakhsb, i mean right after a break,and after each break the count refreshes, and there are more than 1 places i wanna report, so i need the code to be interactivelly asking me the positions right after each break for same file. Thank you. – Lindy Jun 28 '16 at 21:38
  • @Lindy Then you need to put the file reading code in a loop with a prompt for user input (use `input` if you're using Python 3.x, and `raw_input` if Python 2.x), and convert the input into an integer to use as arguments to `seek` and `read`. The line breaks are the same as any other character - one byte per line break, unless you're on Windows, where each line break is two characters. – skrrgwasme Jun 28 '16 at 21:49
  • like this? 'read_length = 6 q = 0 with open("filename.txt") as f: f.seek(#line break) if q < 35 raw_input (starting_position:) f.seek(starting_position) st = f.read(read_length) print(st) q = q+1 f.seek(#next line break#) else print(Done!)' – Lindy Jun 28 '16 at 22:00
  • @Lindy You tell me. Did it work the way you intended? If not, open a new question about it. – skrrgwasme Jun 28 '16 at 22:02
0
stuff = "agoejngodgfjnsodjnfvsojdnvodfjnodjnfbodjngodjgndojgndlkfnvldfkngldjnfgdfjgnldjfn"

print(stuff[10:20])

This will print the characters from position 10 to 20.

So, if you want 1043-1049:

print(stuff[1043:1049])
Chris
  • 1,150
  • 3
  • 13
  • 29