-3

I am working on a data analysis project and I need to split a non-table (not an array yet) database into arrays. The database looks like this:

57, Federal-gov, 337895, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 40, United-States, >50K
38, Private, 28887, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K
41, State-gov, 101603, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K

As you see the different variables/columns are split by commas. I am wondering about what the most efficient way to split this data into separate arrays (each column/ variable becomes a separate array entry) would be. The code should receive this database and then go through each line and set a new array entry with the appropriate value. For now I am fine with using a string array. As well I would like to also remove from the database lines with uncompleted info (missing data expressed as a ?). Any help will be appreciated :). If you have any questions feel free to ask. I am working in Java 1.7. Thanks!

Further information about the database which I am using (if needed): https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

  • What have you tried and what are you having trouble with? I suggest you read up on how to read a text file line by line and how to split each line by comma. I also suggest using Java 8 as it has a Stream API which helps with data processing (Java 1.7 is an old version) – Peter Lawrey Jul 18 '18 at 17:28
  • By *"database of 'comma' lists"*, it seems that you mean a CSV ([Comma-Separated Values](https://en.wikipedia.org/wiki/Comma-separated_values)) *file*? The best way to read CSV files is to use a [CSV parser](https://stackoverflow.com/q/101100/5221149). – Andreas Jul 18 '18 at 17:34

1 Answers1

4

I will definitely not post the full answer here because then this would be a please do it for me. I will show you the algorithm that I'd follow to resolve it and share couple links to get started.

  1. Read the file line by line. How to read line by line by using FileReader
  2. Split the line by comma. How to split a comma-separated string?
  3. Map each of the fields into a class that holds the data and has its appropriate types (POJO). You'll need to access each of the positions of the array and map the field and cast it to the correct type. https://en.wikipedia.org/wiki/Plain_old_Java_object
  4. Add the POJO into an ArrayList or any other sort of list or map it by its ID into a hashmap.
  5. Debug, debug and debug. Think of everything that can go wrong... The file might have a wrong format. What if a value contains a comma? What if you're storing the data into a HashMap and you have a duplicate Id on the file? What if the types are not consistent in the CSV? What if...?

Good luck!

Alex Roig
  • 1,534
  • 12
  • 18