2

I'm new to Python from the R world, and I'm working on big text files, structured in data columns (this is LiDaR data, so generally 60 million + records).

Is it possible to change the field separator (eg from tab-delimited to comma-delimited) of such a big file without having to read the file and do a for loop on the lines?

Andre Silva
  • 4,782
  • 9
  • 52
  • 65
Pierre
  • 1,015
  • 1
  • 9
  • 19
  • 4
    For what it's worth, if you're on a Linux/UNIX system, this sort of thing may be more easily accomplished with sed: `sed -i 's/\t/,/g' file.csv` (or something like that - don't use this without testing it on a small sample file first). – David Z May 18 '11 at 06:36
  • @David - Something like %$#"! and then -> please do not use this whitout testing. – Luka Rahne May 18 '11 at 06:59

4 Answers4

6

No.

  • Read the file in
  • Change separators for each line
  • Write each line back

This is easily doable with just a few lines of Python (not tested but the general approach works):

# Python - it's so readable, the code basically just writes itself ;-)
#
with open('infile') as infile:
  with open('outfile', 'w') as outfile:
    for line in infile:
      fields = line.split('\t')
      outfile.write(','.join(fields))

I'm not familiar with R, but if it has a library function for this it's probably doing exactly the same thing.

Note that this code only reads one line at a time from the file, so the file can be larger than the physical RAM - it's never wholly loaded in.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
1

Actually lets say yes, you can do it without loops eg:

with open('in') as infile:
  with open('out', 'w') as outfile:
      map(lambda line: outfile.write(','.join(line.split('\n'))), infile)
Artem Karp
  • 19
  • 2
  • Your code has no effect. [`map`](https://docs.python.org/3/library/functions.html#map) returns an iterator and you don't consume it anywhere. – radzak Apr 06 '18 at 10:54
  • because I dont need to consume it, are you familiar with map function at all not only in Python world? – Artem Karp Apr 06 '18 at 11:11
  • Yes, I am, but for example: `map(lambda x: print(x), [0, 1])` is no-op. – radzak Apr 06 '18 at 11:14
  • Yes, and in python 2.x its invalid syntax, or not working because seems print is not a function, anyway there is another question https://stackoverflow.com/questions/7731213/print-doesnt-print-when-its-in-map-python – Artem Karp Apr 06 '18 at 15:20
  • Exactly, glad you found the link. You can fix your code then. – radzak Apr 06 '18 at 15:25
1

You can use the linux tr command to replace any character with any other character.

Jason Sundram
  • 12,225
  • 19
  • 71
  • 86
masyanya
  • 11
  • 1
0

You cant, but i strongly advise you to check generators.

Point is that you can make faster and well structured program without need to write and store data in memory in order to process it.

For instance

file = open("bigfile","w")
j = (i.split("\t") for i in file)
s = (","join(i) for i in j)
#and now magic happens
for i in s:
     some_other_file.write(i)

This code spends memory for holding only single line.

Luka Rahne
  • 10,336
  • 3
  • 34
  • 56
  • Is there any reason why ``some_other_file.write(",".join(i.split("\t") for i in file))`` would be worse ? - Also, you must end with ``file.close()``. And better to avoid name of a builtin-in identifier _file_ – eyquem Dec 12 '11 at 18:09