2

I am processing a text file which contains up to a thousand lines. There are multiple headers and footers in one text file. So I don't need to process the line which contains @h and @f. It tells me the beginning and end of a transaction (Database transaction, I will save those records to DB in one transaction).

A sample record is below. Though the line reaches up to a thousand lines and the columns are up to 40 columns. From each line I am only looking for a specific data i.e (e.g i need to get a name from postion 8 to 30, year from position 60 to 67 and the likes). This position might be next a space or between strings. So I don't want to put the data of each line in to buffer/memory to process it because, I am only interested on few of them. Does CSV file allows to get a data from a specific position in a line? What should I use to get a better performance (to process the data as quick as possible without taking much memory.)? I am using Java

@h Header
@074VH01MATT    TARA   A5119812073921 RONG HI  DE BET IA76200  201108222   0500  *
@074VH01KAYT    DJ     A5119812073921 RONG DED CR BET IA71200  201108222   0500  *
@f Footer

@h Header
@074VH01MATT    TARA   A5119812073921 RONG HI  DE BET IA76200  201108222   0500  *
@074VH01KAYT    DJ     A5119812073921 RONG DED CR BET IA71200  201108222   0500  *
@f Footer
WowBow
  • 7,137
  • 17
  • 65
  • 103
  • Sounds like it would be simpler to just dump it into a database. Even SQLite would do the trick. – Michael Myers Jun 26 '12 at 17:06
  • Finally I will but, before that I need to get those specific positions of data. For e.g from the second line the first name is MATT .. Year 2011 .. but to get this data I need to process each line and in to a specific position. I know where to go (position 60-67) and soon, but I don't want to take the whole line in to a memory – WowBow Jun 26 '12 at 17:10
  • Is the position of the data you are parsing fixed? – Dhwaneet Bhatt Jun 26 '12 at 17:11
  • You could read line by line. But think about it: a thousand lines, each maxing at 40 characters, is at max 40K. **Nothing**. The RFID chip on a toothpaste package probably has more memory than that. :-) Just read it all into memory. – user949300 Jun 26 '12 at 17:44
  • Why wouldn't you want to take the whole line into memory? By the time you read the file through the streams and buffers, the line will pretty much already be in memory and trying to avoid that will cause more overhead. – aglassman Jun 26 '12 at 17:45
  • Lol @user949300, you took the words right out of my mouth. – aglassman Jun 26 '12 at 17:46
  • @DhwaneetBhatt .. Yes it is fixed position. user94... actually 40 is just the column .. each column can contain up to 40 chars or more beside there is a big spacing. I know I am thinking too much but, I was looking a better solution. – WowBow Jun 26 '12 at 18:38
  • @user949300 .loool .. you are right.. I was thinking too much!! – WowBow Jun 26 '12 at 19:01

4 Answers4

5

Here is my solution:

import java.io.*;
class ReadAFileLineByLine 
{
 public static void main(String args[])
  {
  try{
    FileInputStream fstream = new FileInputStream("textfile.txt");
    BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
    String strLine;
    //Loop through and check if a header or footer line, if not
    //equate a substring to a temp variable and print it....
    while ((strLine = br.readLine()) != null)   {
      if (!(strLine.charAt(1) == "h" || strLine.charAt(1) == "f"))
        String tempName = strLine.substring(8,31);
      System.out.println(tempName);
    }
    //Close the input stream
    in.close();
  } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Is something like this what you're looking for?

Failsafe
  • 750
  • 2
  • 7
  • 23
  • +1 for the FileInputStream + DataInputStream combination. Safer way of reading a file than FileReader. See http://stackoverflow.com/a/5155255/470838 for an explanation. – orangepips Jun 26 '12 at 17:35
  • This looks like it would work, but you'd also need to add a line to get year. Also, to OP, CSV stands for Comma-seperated Values, which this file is not, this file is space delimited. – aglassman Jun 26 '12 at 17:43
  • @fgb: DataInputStream is being passed into the InputStreamReader. As for the character encoding, you are correct - so InputStreamReader should be `new InputStreamReader(in, "UTF-8")` or some other expected character set, else it will pick up the system default. – orangepips Jun 26 '12 at 17:53
  • I don't see the use of wrapping the FileInputStream in DataInputStream since it is not used directly. What is the reason for this? – aglassman Jun 26 '12 at 17:56
  • @aglassman: you're correct the DataInputStream is extraneous. I miswrote. Only the FileInputStream + InputStreamReader are needed. – orangepips Jun 26 '12 at 17:59
  • @aglassman - I used it because it's what I do every time, it's really just instinctual now to do it that way... – Failsafe Jun 26 '12 at 18:02
  • @aglassman - Also I was just writing this as an example to grab data from the file, I want him to do some work, not copy paste it... – Failsafe Jun 26 '12 at 18:05
  • Haha, I understand, I was just making sure I wasn't doing it wrong myself. =P – aglassman Jun 26 '12 at 18:07
  • @Failsafe .. yep .. it was something like this .. but you put each line in to a (strLine = br.readLine()) .. which I didn't want to do .. I want to go to the specific position without dumping the whole line on a string or memory .. though people are saying not to worry about the memory because, it is not that big. Thanks – WowBow Jun 26 '12 at 18:45
  • @Wowbow - You would have to worry about memory if this was 1980. Fortunately for us, we live in an era in which memory allocation(especially in java) is the least of anyone's worries. – Failsafe Jun 26 '12 at 19:11
  • @Failsafe could you please edit it, if we shouldn't use DataInputStream , so that I can accept it as an answer – WowBow Jun 26 '12 at 19:15
4

Use a BufferedReader so it doesn't hold everything in memory constructed from an InputStreamReader so you can specify the character set (as the JavaDoc for FileReader tells to do) - my example below uses UTF-8 assuming the file is in the same encoding.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class StringData {
    public static void main(String[] args) throws Exception {
        BufferedReader br = null;
        try {
            // change this value
            FileInputStream fis = new FileInputStream("/path/to/StringData.txt");
            br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
            String sCurrentLine;
            while ((sCurrentLine = br.readLine()) != null) {
                processLine(sCurrentLine);
            }
        } finally {
            if (br != null) br.close();
        }
    }

    public static void processLine(String line) {
        // skip header & footer
        if (line.startsWith("@h Header") || line.startsWith("@f Footer")) return;

        String name = line.substring(8, 22);
        String year = line.substring(63, 67);

        System.out.println("Name [" + name + "]\t Year [" + year +"]");
    }
}

Output

Name [MATT    TARA  ]    Year [2011] 
Name [KAYT    DJ    ]    Year [2011]
orangepips
  • 9,891
  • 6
  • 33
  • 57
1

I don't think CSV is a must, how are you reading the file, line by line or all at once? I would go with line by line, that way, reading each line is not costly in memory (only one line at a time). You can use a regex on the line and take only the groups you need(with Pattern and Matcher) to help extract exactly what you need.

wilfo
  • 685
  • 1
  • 6
  • 19
0

Don't worry about memory; you can put the whole file in one char array without anybody noticing. CSV files are a pain and won't do anything for you. Just read each row into a buffer--a String, or char or byte array--and grab from it what you need; the fixed positioning makes it easy.

In general, there's a tradeoff between memory and time. I've found big buffers, say 100Kb to over 1Mb as opposed to, say, 10Kb, can speed you up 5 to 10 times. (Test it yourself with various sizes if it matters. If I understand you right, you're talking about a 40Kb, so no need for a buffer bigger than that. (If it's 40 Mega b then do the tests. Even a 40Mb array won't hurt you, but now you are starting to waste memory.)) Just be sure to close the file and release references to the file class(es) before going on to do other work so your buffers etc. are not a memory leak.

RalphChapin
  • 3,108
  • 16
  • 18