preprocessing (rstrip and regular expression and simpler code)

Question

I'm trying to read 200 txt files and do some preprocessing.

1) how could i write simpler code instead of writing same code for each of txt files?

2) can i combine regular expression with rstrip?

-> mainly, i want to get rid of "\n" but sometimes they are sticked with other letters.so what i want is remove every \n as well as words that are combined with \n (i.e. "\n?", "!\n" .. and so on)

3) at the last line, is there a way to add all list in one list with simpler code?

data = open("job (0).txt", 'r').read()
rows0 = data.split(" ")
rows0 = [item.rstrip('\n?, \n') for item in rows0]

data = open("job (1).txt", 'r').read()
rows1 = data.split(" ")
rows1 = [item.rstrip('\n?, \n') for item in rows1]

.....(up to 200th file)

data = open("job (199).txt", 'r').read()
rows199 = data.split(" ")
rows199 = [item.rstrip('\n?, \n') for item in rows199]

ds_l = rows0 + rows1 + ... rows199

score 0 · Answer 1 · edited May 23 '17 at 12:22

First of all, I'm not a python expert. But since the question has been around for a while already... (At least I'm save from downvotes if no one looks at this^^)

1) Use loops, and read a programming tutorial. See for example this post How do I read a file line-by-line into a list? on how to get a list of all rows. Then you can loop over the list.

2) No idea whether it's possible to use regexes with strip, this brought me here, so tell me if you find out.

It's not clear what exactly you are asking for, do you want to get rid of all (space seperated) words that contain any "/n", or just cut out the "/n","/n?",... parts of the words?

In the first case, a simple, unelegant solution would be to just have two loops over rows and over all words in a row and do something like

# loop over rows with i as index row = rows[i].split(" ") for j in range len(row): if("/n" in row[j]) del row[j] rows[i] = " ".join(row)

In the latter case, if there's not so many expressions you want to remove, you can probably use re.sub() somehow. Google helps ;)

3) If you have the rows as a list "rows" of strings, you can use join:

ds_1 = "".join(rows)

(For join: Python join: why is it string.join(list) instead of list.join(string)?)

preprocessing (rstrip and regular expression and simpler code)

1 Answers1