3

I have a huge (4,5 GB) csv file.. I need to perform basic cut and paste, replace operations for some columns.. the data is pretty well organized.. the only problem is I cannot play with it with Excel because of the size (2000 rows, 550000 columns).

here is some part of the data:

ID,Affection,Sex,DRB1_1,DRB1_2,SENum,SEStatus,AntiCCP,RFUW,rs3094315,rs12562034,rs3934834,rs9442372,rs3737728

D0024949,0,F,0101,0401,SS,yes,?,?,A_A,A_A,G_G,G_G
D0024302,0,F,0101,7,SN,yes,?,?,A_A,G_G,A_G,?_?
D0023151,0,F,0101,11,SN,yes,?,?,A_A,G_G,G_G,G_G

I need to remove 4th, 5th, 6th, 7th, 8th and 9th columns; I need to find every _ character from column 10 onwards and replace it with a space ( ) character; I need to replace every ? with zero (0); I need to replace every comma with a tab; I need to remove first row (that has column names; I need to replace every 0 with 1, every 1 with 2 and every ? with 0 in 2nd column; I need to replace F with 2, M with 1 and ? with 0 in 3rd column;

so that in the resulting file the output reads:

D0024949 1 2 A A A A G G G G

D0024302 1 2 A A G G A G 0 0

D0023151 1 2 A A G G G G G G

(both input and output should read one line per row, ne extra blank row) Is there a memory efficient way of doing that with java(and I need a code to do that) or a usable tool for playing with this large data so that I can easily apply Excel functionality..

David Basarab
  • 72,212
  • 42
  • 129
  • 156
notapipe
  • 31
  • 2
  • 1
    you could consider tools like [sed and awk](http://www.faqs.org/docs/abs/HTML/sedawk.html) – miku Jun 02 '10 at 13:10
  • 1
    Why not simply read a row of data, parse as comma delimited, perform desired actions, write the new comma delimited line to a new file, go back to step 1. If your data is clean then you could simply split the line on the comma or parse with basic csv rules or even a regex might work. I don't see what's so difficult? –  Jun 02 '10 at 13:26

1 Answers1

1

You need two things:
- Knowledge of Regular Expressions (aka Regex, Regexes)
- PowerGrep

PeterM
  • 1,188
  • 1
  • 10
  • 15