extract home sales data from poorly structured table

Question

I was handed a large table of home sales data. In the full table, there are 3548 unique towns, with ~100,000 salespeople. My goal is to find out, approximately, which town has the most homes being sold. Unfortunately the different towns are all stuffed into one field.

+----+--------------+-------------+-------------+-----------------+
| id |  salesperson | salesAmount |  unitsSold  |     town        |
+----+--------------+-------------+-------------+-----------------+
| 1       Joe          245000            4        Brentwood;      |
|                                                 Marksburg;      |
|                                                 Orange Heights; |
+-----------------------------------------------------------------+
| 2       Sally        783290            6        Oak Hills;      |
|                                                 Jacksonville;   |
|                                                 Brentwood;      |
+-----------------------------------------------------------------+
| 3       Bob          398000            2        Eastport;       |
|                                                 Marksburg;      |
+-----------------------------------------------------------------+

Let's say I wanted to find out the approximate amount of Brentwood sales. I can run a query like this:

 SELECT SUM(salesAmount), SUM(unitsSold) from mytable WHERE town LIKE '%Brentwood%';

Which isn't perfect because, for example, we don't know in row 1 exactly how much the Brentwood home sold for. But I still can get a good idea of what towns homes are selling well in.

I'd really like to massage the data to get a table that looks like this:

+----+--------------+-------------------+------------------+
| id |   town       | salesAmountTotal  |  unitsSoldTotal  |         
+----+--------------+-------------------+------------------+
| 1     Brentwood      5,028,290                32         |
|                                                          |
+----------------------------------------------------------+
| 2     Oak Hills      3,783290                 18         |
|                                                          |
+----------------------------------------------------------+
| 3     Eastport       1,398,000                 6         |
|                                                          |
+----------------------------------------------------------+

My boss really doesn't care who sold what, but she does care about in what town it was sold.

How can I pull out the sales data to get a table like the one above?

Any help here greatly appreciated.

you won't be able to a total sales amount for each town, however you might be able to come up with an estimate based on your above and get an average amount — nomistic, May 29 '15 at 23:12

score 1 · Answer 1 · answered May 29 '15 at 23:09

1

There is only one answer. You can't do this, with this structure, because You'll never know how many was selling for exactly one town.

answered May 29 '15 at 23:09

Damonsson

1,532
14
25

the boss knows this. i explained to her the weakness of this data, but i think we can still get a good general idea of which town home are selling well in. – Dave Campagnolo May 29 '15 at 23:21
If You have unitsSold equals number of towns in row it is simple to get. But if You have more, You can only divide unitsSold by any town, and that give You closer reality statistics. – Damonsson May 29 '15 at 23:36

score 1 · Answer 2 · edited May 23 '17 at 11:43

You can split the town names, and determine the number of towns each sales person did their sales in. That way, you can at least calculate the average value of each property sold per person, and the average number or units sold per person.

Those averages can be aggregated by town again, and eventually you get a query like the one below. It's nested, three layers deep, but you can execute each inner query separately to see its results.

I'm afraid those averages are the best you can get. Regardless of how complex you could make the query, there are just no more details left to split.

Technical breakdown

You can find a query below and a link to a fiddle. The query below uses a number of tricks:

Number generator

First of all it generates a list of numbers. You can generate numbers by selecting rows from a table and using a variable to assign a number to each row. If you lack a large table, you can fake one by selecting constant values. In the inner query below, I select 10 values. Then I select 10 values again and cross join those two lists, generating 100 values in total (it's cartesian product, thus, the number of rows multiplied). The number you need to have is the highest number (or more) of towns one sales person has in their list of towns. I think 100 should be enough, but you can make it 1000 by just adding one more cross join.

See this answer for another example.

Splitting the town names

This uses SUBSTRING_INDEX. That function can get the first or last part of a string upto the Nth occurrance of a given delimiter. So, by specifying a string A,B,C,D and the index 2, you get 'A,B,C'.

Then the function is applied again, with a negative index, which gets the last part. That way, you can isolate one town name from the list.

See a random resource about this function.

Getting the number of towns

This is done by comparing the length of the list of towns, minus the length of the same string after the delimiters are removed. In your example data, there is a delimiter at the end of the list, which I assume always to be the case. If not, you need to add 1 (or update the data first, so there always is one, since it would be needed for getting the town name itself too).

See this answer for another example.

The rest

Well, the rest is just some divisions and aggregations. They make the query large, but not much more complex.

select
  splittown as town,
  sum(salesAmountPerTown) as totalSalesAmountPerTown,
  sum(unitsSoldPerTown) as totalUnitsSoldPerTown
from
  (select
    splitSalesPerTown.*,
    -- Estimate of sales per town, is units sold and total amount divided by the number of towns.
    unitsSold / nrOfTowns as unitsSoldPerTown,
    salesAmount / nrOfTowns as salesAmountPerTown
  from
    (select
      s.person,
      -- Just the sales amount
      s.salesAmount,
       -- Just the number of units sold.
      s.unitsSold,
      -- Sales amount divided by units sold = avarage amount per unit
      s.salesAmount / s.unitsSold as salesAmountPerUnit,
      -- Separate the towns using the number generator
      SUBSTRING_INDEX(SUBSTRING_INDEX(s.town, ';', r.row), ';', -1) as splittown,
      -- count the number of towns (assumes there is always an extra ';' at the end of the list)
      LENGTH(s.town) - LENGTH(REPLACE(s.town, ';', '')) as nrOfTowns
    from
      sales s
      -- Very ugly number generator. You could make it simpler if you have a large table to base this upon.
      cross join
        (select @row := @row + 1 as row
        from
          (select 0 union all select 1 union all select 2 union all select 3 union all select 4  union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) a
          cross join
          (select 0 union all select 1 union all select 2 union all select 3 union all select 4  union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) b
          cross join (select @row := 0) i) r
    having
      -- skip those rows that don't have a town name
      splittown <> ''
    ) splitSalesPerTown
   ) splitSalesPerPerson
 group by
   town

The proof is in the fiddle

Other thoughts

Instead of making this one complex query, you may also use a simplified version to get the unique town names and save those in a separate table.

After that, you can use that table to split the sales information per person to separate towns (again, in a separate table).

And finally, you can aggregate that information to get the totals and averages per town, eliminating persons.

That approach might be a little simpler, and you could even decide to download the list of towns or use an external tool to split them instead of a query.

But since I don't know whether it's possible for you to create tables at all, I thought I'd better build one query, if only to show that much is possible if you really want to. ;)

thanks Golez. I've got a lot to work with here. i'll try your suggestions. i've got the data in a table that I have control over. i've also extracted a unique list of towns from the original csv, so I could dump those into a table. — Dave Campagnolo, May 30 '15 at 00:10

Tim3880 · Answer 3 · 2015-05-29T23:20:49.587

0

If you have the table of unique town names, and you just want a rough idea which town is hot, you can try this:

 Select towns.name, (select sum(unitsSold) from mytable where town 
 like '%;'+towns.name+';%' or town like towns.name +';%') AS unitsold from towns

Change the column name if you want to use amount.

edited May 29 '15 at 23:20

answered May 29 '15 at 23:11

Tim3880

2,563
1
11
14

Yes, but not the amount they were sold for. An average might be possible using statistics – nomistic May 29 '15 at 23:17
1

He just wants a good idea. " I still can get a good idea of what towns homes are selling well in." – Tim3880 May 29 '15 at 23:19
i extracted a unique list of the towns, so i guess i could dump them into another table and try your solution. thanks for the idea. – Dave Campagnolo May 29 '15 at 23:22
just make sure your boss understand this is just a "rough idea". There is no way to get accurate numbers with this kind of data. – Tim3880 May 29 '15 at 23:45

extract home sales data from poorly structured table

3 Answers3