2

I am learning Python, and am trying to learn data.split(). I found the following in another StackOverflow question (link here), discussing appending a file in Python.

I have created biki.txt per the above link. Here's my code:

import re
import os
import sys 
with open("biki.txt","r") as myfile:
    mydata = myfile.read()
    data = mydata.replace("http","%http")
    for m in range (1,1000):
        dat1 = data.split("%")[m]
        f = open ("new.txt", "a")
        f.write(dat1)
        f.close()

But when I run the above, I get the error:

dat1 = data.split("%")[m]
IndexError: list index out of range

How come? I can't find documentation as to what that [m] does, but removing it doesn't fix the issue. (If I remove [m], then the error changes and says that f.write(dat1) must be a string, or read only character buffer (?).

Thank you for any help or ideas!

Community
  • 1
  • 1
user3718365
  • 515
  • 3
  • 6
  • 13

2 Answers2

2

You should just iterate over data.split():

    for dat1 in data.split("%"):

Now you only split once (rather than on every iteration), it doesn't have to contain 1000+ items (which was the cause of the IndexError) and it gives a string to f.write() rather than a list (the source of the other error).

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
  • When you say 'iterate over data.split()', what do you mean? (Again, I'm new to python, and I do have experience with VBA, I am still learning the vocab for Python). I have replaced the above code with the following, and it runs without error, creates the "new.txt", BUT doesn't split the URLs into new lines: with open("biki.txt","r") as myfile: mydata = myfile.read() data = mydata.replace("http","%http") for m in range (1,5): #dat1 = data.split("%")[m] for dat1 in data.split("%"): f = open ("new.txt", "a") f.write(dat1) f.close() – user3718365 Jun 07 '14 at 18:56
  • @user3718365 Code is very difficult to read in comments, particularly in python (where white space is important). I suggest you work through a tutorial, e.g. [the official one](https://docs.python.org/2/tutorial/), which will introduce these concepts; SO is not an appropriate place to cover such basic material. – jonrsharpe Jun 07 '14 at 20:43
2

First, you need understand what is happening with m in your code. Assuming:

for m in range(1,1000):
    print(m)

In the first loop, the value of m will be equal to 1.

In the next loop (and until m be less than 1000) the value of m will be m+1, I mean, if in the previous loop the value of m was 1, then, in this loop m will be equal to 2.

Second, you need to understand that the expression data.split('%') will split a string where it finds a '%' character, returning a list.

For example, assuming:

data = "one%two%three%four%five"
numbers = data.split('%')

numbers will be a list with five elements like this:

numbers = ['one','two','three','four','five']

To get each element on a list, you must subscript the list, which means to use the fancy [] operators and an index number (actually, you can do a lot more, like slicing):

numbers[0] # will return 'one'
numbers[1] # will return 'two'
...
numbers[4] # will return 'five'

Note that the first element on a list has index 0.

The list numbers has 5 elements, and the indexing starts with 0, so, the last element will have index 4. If you try to subscript with an index higher than 4, the Python Interpreter will raise an IndexError since there is no element at such index.

Your code is generating a list with less elements than the range you created. So, the list index is being exhausted before the for loop is done. I mean, if dat1 has 500 elements, when the value of m is 500 (don't forget that list indexes starts with 0) an IndexError is raised.

If I got what you want to do, you may achieve your objective with this code:

with open("input.txt","r") as file_input:
    raw_text = file_input.read()

formated_text = raw_text.replace("http","%http")
data_list = formated_text.split("%")

with open("output.txt","w") as file_output:
    for data in data_list:
        file_output.write(data+'\n') # writting one URL per line ;)
Pablo
  • 1,311
  • 9
  • 20
  • The code you posted worked! Thanks for your detailed explanation! Quick question - when I create "output.txt", the top line is blank (I assume the code adds a new line /n before looking at URLs in the input.txt) - how can I make sure the output.txt has a URL as the top line, not a blank line, then the four URLs? (Meta question: how do I reply on StackOverflow as a normal comment, not a reply to yours, where my response is small and unformatted? like, how did you add a "new" response?) – user3718365 Jun 07 '14 at 20:36
  • @user3718365 there is no functionality to respond other than in comments (which do `support` *some* **formatting**). It is not appropriate to 'comment' by creating a new answer. – jonrsharpe Jun 07 '14 at 20:40
  • Consider a string '%http'. Then, experiment '%http'.split() and watch the results. It will be a list of two elements: ['','http']. Since I don't know how your input file is, a simple way to write just the elements that have a 'http' would be creating an if statement inside the for loop (if 'http' in data: file_output.write...). Another way would be creating a [list comprehension](https://docs.python.org/3.4/tutorial/datastructures.html#list-comprehensions): formated_data = [data for data in raw_text.replace("http","%http") if 'http' in data] (put this just in one line). – Pablo Jun 08 '14 at 11:15
  • Thanks for that Pablo, I'll try that out! FYI my file is simply four or five URLS, back to back (http://www.youtube.comhttp://www.youtube.com/watch?v=CDddd28http://...etc) and I just want to split them on to new lines. – user3718365 Jun 09 '14 at 22:12