-1

I have a dataset and I need to reconstruct some data from this dataset to a new style

My dataset is something like below (Stored in a file named train1.txt):

2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468

I need to convert to below style (I need to store in a new file as train.txt):

2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….

My python version is 2.7.13 My operating system is Ubuntu 14.04 LTS I will appreciate you for any help. Thank you so much.

Saurabh P Bhandari
  • 6,014
  • 1
  • 19
  • 50
  • 2
    Hi, welcome to Stack Overflow. I think your task can be done easily, but please provide us with the code that you are working on. – N. Arunoprayoch Jun 17 '19 at 01:06
  • Not quite a dup, but if you're not fussy about the language, [here is how to do it in a Linux/BSD shell](https://stackoverflow.com/a/10758101/1270789). – Ken Y-N Jun 17 '19 at 01:12

3 Answers3

1

I would suggest using regex (regular expressions). This might be a little overkill, but in the long run, knowing regex is super powerful.

import re
def return_no_commas(string):
    regex = r'\d*'
    matches = re.findall(regex, string)
    for match in matches:
        print(match)


numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""

return_no_commas(numbers)

Let me explain what everything does.

import re

just imports regular expressions. The regular expression I wrote is

regex = r'\d*'

the "r" at the beginning says it's a regex and it just looks for any number (which is the "\d" part) and says it can repeat any number of times (which is the "*" part). Then we print out all the matches.

I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents.

You'll get something like:

2342728


2414939


2397722


2386848


2398737


2367906


2384003


2399896


2359702


2414293


2411228


2416802


2322710


2387437


2397274


2344681


2396522


2386676


2413824


2328225


2413833


2335374


2328594


497966


2384001


2372746


2386538


2348518


2380037


2374364


2352054


2377990


2367915


2412520


2348070


2356469


2353541


2413446


2391930


2366968


2364762


2347618


2396550


2370538


2393212
stevestar888
  • 113
  • 7
0

It sounds to me like your original data is separated by commas. However, you want the data separated by new-line characters (\n) instead. This is very easy to do.

def covert_comma_to_newline(rfilename, wfilename):
    """
    rfilename -- name of file to read-from
    wfilename -- name of file to write-to
    """
    assert(rfilename != wfilename)
    # open two files, one in read-mode
    # the other in write-mode
    rfile = open(rfilename, "r")
    wfile = open(wfilename, "w")

    # read the file into a string
    rstryng = rfile.read()

    lyst = rstryng.split(",")
    # EXAMPLE:
    #     rstryng == "1,2,3,4"
    #     lyst    == ["1", "2", "3", "4"]

    # remove leading and trailing whitespace
    lyst = [s.strip() for s in lyst]

    wstryng = "\n".join(lyst)
    wfile.writelines(wstryng)
    rfile.close()
    wfile.close()
    return


covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`
Toothpick Anemone
  • 4,290
  • 2
  • 20
  • 42
0

Since others have added answers, I will include one using numpy. If you are ok using numpy, it is as simple as:

 data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')

If you want a list instead of numpy array,

data.tolist()

[2342728,
 2414939,
 2397722,
 2386848,
 2398737,
 2367906,
 2384003,
 2399896,
 ....
]
Unni
  • 5,348
  • 6
  • 36
  • 55