# CIC data processing tutorial

Use the `cic.processor.make_cic_config_template()` method to create a configuration file template and fill it with necessary information. The configuration file is used at processing the data files.
```python
from cic.processor import make_cic_config_template
make_cic_config_template("/home/user/viikki.yml")
```
Running the above commands will create a configuration file template in the file `/home/user/viikki.yml`. After filling in the information the configuration file may look like this: 
```yaml
measurement_location: Viikki, Helsinki, Finland
id: viikki
description: Agricultural site
instrument_model: CIC-1-1
longitude: 25.02
latitude: 60.23
data_folder:
- /home/user/data/2021
- /home/user/data/2022
processed_folder: /home/user/viikki
database_file: /home/user/viikki.json
start_date: 2022-09-28
end_date: 2022-09-30
inlet_length: 1.0
do_inlet_loss_correction: true
convert_to_standard_conditions: true
allow_reprocess: false
redo_database: false
file_format: block
resolution: 10min
```

Then process the data files by running `cic_processor()`
```python
from cic.processor import cic_processor
cic_processor("/home/user/viikki.yml")
```
```
Added 20220928 to database (Viikki, Helsinki, Finland) ...
Added 20220928 to database (Viikki, Helsinki, Finland) ...
Added 20220928 to database (Viikki, Helsinki, Finland) ...
Processing 20220928 (Viikki, Helsinki, Finland) ...
Processing 20220929 (Viikki, Helsinki, Finland) ...
Processing 20220930 (Viikki, Helsinki, Finland) ...
Done!
```
The code produces daily processed data files `CIC_yyyymmdd.nc` (netCDF format). These files are saved in the destination given in the configuration file.

The locations of raw and processed files for each day are written in the JSON formatted `database_file`. This database keeps track of the files and prevents reprocessing in a continuous measurement setting. 

  * If `allow_reprocess: false` only files newer than the newest file in the database are processed.
  * If `allow_reprocess: true` any unprocessed files in the time range are attempted to be processed.
  * If you want everything to be reprocessed use `redo_database: true` otherwise keep `redo_database: false`

The netcdf files have the following structure:
| Fields             | Dimensions    | Data type      | Units | Comments                                       |
|--------------------|---------------|----------------|-------|------------------------------------------------|
| **Coordinates**    |               |                |       |                                                |
| time               | time          | datetime64[ns] |       | timezone: utc                                  |
| flag               | flag          | string         |       |                                                |
| **Data variables** |               |                |       |                                                |
| neg_conc_1         | time          | float          | cm-3  | Negative ion number concentration in channel 1 |
| neg_conc_2         | time          | float          | cm-3  | Negative ion number concentration in channel 2 |
| neg_conc_3         | time          | float          | cm-3  | Negative ion number concentration in channel 3 |
| pos_conc_1         | time          | float          | cm-3  | Positive ion number concentration in channel 1 |
| pos_conc_2         | time          | float          | cm-3  | Positive ion number concentration in channel 2 |
| pos_conc_3         | time          | float          | cm-3  | Positive ion number concentration in channel 3 |
| neg_temperature    | time          | float          | K     |                                                |
| pos_temperature    | time          | float          | K     |                                                |
| neg_pressure       | time          | float          | Pa    |                                                |
| pos_pressure       | time          | float          | Pa    |                                                |
| neg_sampleflow     | time          | float          | lpm   |                                                |
| pos_sampleflow     | time          | float          | lpm   |                                                |
| neg_ion_flags      | time,flag     | int            |       | flag=1, no flag=0                              |
| pos_ion_flags      | time,flag     | int            |       | flag=1, no flag=0                              |
| **Attributes**     |               |                |       |                                                |
| Measurement info   |               | dictionary     |       |                                                |

Below is an examples of how to access data in the netcdf file.
```python
import xarray as xr
import pandas as pd

# load the dataset
ds = xr.open_dataset("/home/user/viikki/CIC_20220928.nc")

# Get negative ion number concentration in channel 1
neg_conc_1 = ds.neg_conc_1.to_pandas()

# Close the file
ds.close()
```

We can combine the previously created files into a single continuous datase. We save the result as a netcdf file.
```python
from cic.utils import combine_data
import pandas as pd
import xarray as xr
from pathlib import Path
import os

data_source = Path("/home/user/viikki")
data_files = [data_source / f for f 
    in os.listdir(data_source) if ".nc" in f]
date_start = "2022-09-28"
date_end = "2022-09-30"

# Combine the data into a single dataset with 30 min time 
# resolution and flag a data line only if 50% or more 
# of the data inside the 30 min time window contain the flag.
ds = combine_data(data_files, date_start, date_end, "30min",
    flag_sensitivity=0.5)

ds.to_netcdf("combined_cic_dataset.nc")
```